Once you are comfortable working with a single DataFrame, the next step is combining multiple tables together. Real data almost never lives in one place. Customer information sits in one table, order history in another, product details in a third. Knowing how to combine these correctly, and how to reshape the result, is what makes analysis practical.
The Four Join Types
Before writing any code, it helps to have a clear picture of what each join type does. Two tables going into a merge will come out differently depending on which rows are kept when the key does not match on both sides.
An inner join keeps only the rows where the key exists in both tables. A left join keeps every row from the left table and fills in nulls where there is no match on the right. A right join does the same but in favour of the right table. An outer join is the most generous: it keeps every row from both tables and fills nulls wherever a match was missing on either side.
Basic Merge
result = orders.merge(customers, on='customer_id')
A merge stitches two tables together based on a shared column. For every row in the left table, pandas finds the row in the right table with the same key value and glues them side by side. The result is a wider table containing all the columns from both. By default the join is inner, so rows where the key exists in only one table are dropped. The original tables are not modified.
When the shared column has the same name in both tables, on= is all you need. When the column has different names on each side, use left_on and right_on instead:
products.merge(pricing, left_on='id', right_on='product_id')
Both columns end up in the result, which is usually redundant. You will often drop one of them afterwards to keep things tidy.
When both tables contain a column with the same name that is not the merge key, pandas cannot keep both under the same name in the result. The suffixes argument handles this:
drivers.merge(vehicles, on='vehicle_id', suffixes=('_driver', '_vehicle'))
Without specifying suffixes, pandas falls back to _x and _y, which tell you nothing about where each column came from. Setting meaningful suffixes at merge time saves confusion downstream.
Choosing the Right Join Type
# Keep all rows from the left tableproducts.merge(inventory, on='product_id', how='left')# Keep all rows from the right tableactive_users.merge(new_users, on='user_id', how='right')# Keep all rows from both tablescohort_a.merge(cohort_b, on='id', how='outer', suffixes=('_a', '_b'))
The choice of how= depends on what question you are answering. If you want every product in your catalogue along with its inventory level where one exists, that is a left join. If you want the union of two customer lists with no one excluded, that is an outer join. If you only want records that appear in both, inner is the right choice.
A useful pattern after a left or outer join is finding rows that had no match on the other side. After the join, those rows have null values in the columns that came from the unmatched table:
unmatched_only = merged_df[merged_df['plan_type_right'].isnull()]
This is the pandas equivalent of an anti-join: rows that exist in one table but not the other.
Chaining Multiple Merges
Because each .merge() returns a new DataFrame, you can chain them directly:
result = accounts.merge(region_data, on='postcode') \ .merge(districts, on='district_id')
The backslash continues the expression onto the next line for readability. This reads naturally: merge accounts with regional data on postcode, then take that result and merge it with district information on district ID. This pattern comes up constantly when you have one central table and several lookup tables you need to attach to it one at a time.
Self Joins
A self join merges a table with itself. This sounds unusual but is a standard tool for comparing rows within the same table:
staff_pairs = staff.merge(staff, on='project_id', how='inner', suffixes=('_lead', '_member'))lead_and_member = ( (staff_pairs['role_lead'] == 'Lead') & (staff_pairs['role_member'] != 'Lead'))result = staff_pairs[lead_and_member]
Merging the staff table with itself on project_id produces every possible pairing of two people working on the same project. The suffixes distinguish the two copies of every column. Filtering then keeps only the pairs where one person is a lead and the other is not, which gives you each lead paired with each of their project collaborators. Self joins are the standard approach for hierarchies, within-group comparisons, and any data with graph-like relationships.
Common Operations After Merging
After combining tables, the typical next steps are aggregating and sorting:
merged_df['plan_type'].value_counts()merged_df.groupby('department').agg({'account_id': 'count'})merged_df.sort_values('account_id', ascending=False)merged_df.groupby(['district', 'population', 'vacancy_rate'], as_index=False) \ .agg({'account_id': 'count'}) \ .sort_values(['vacancy_rate', 'account_id', 'population'], ascending=[False, True, True])
value_counts gives a quick frequency ranking of any column. groupby with agg counts rows per group. as_index=False keeps the group columns as regular columns rather than pushing them into the index, which makes subsequent sorting straightforward. The chained version at the bottom is the full analysis pipeline: group by three columns, count rows per group, then sort by three columns with mixed directions.
Stacking Tables with pd.concat()
Concatenation is different from merging. Where a merge glues tables side by side based on matching keys, pd.concat stacks them on top of each other:
pd.concat([jan_data, feb_data, mar_data])pd.concat([jan_data, feb_data, mar_data], join='inner', sort=True)pd.concat([jan_data, feb_data, mar_data], keys=['January', 'February', 'March'])
Columns line up by name across all the input tables. If they all have identical columns, the result is one taller table with the same structure. If they differ, the default behaviour fills missing columns with null values. Passing join='inner' restricts the result to only columns that appear in every input table.
The keys argument adds a label to each chunk of rows, which becomes the outer level of a multi-level index. When you have separate files for each month and want to remember which row came from which month after combining them, this is the cleanest way to do it. Once the labels are in place, you can group by them directly:
combined.groupby(level=0).agg({'revenue': 'mean'})
level=0 refers to the outer index level, which holds the keys you supplied. This gives you one row per input table in the result.
Time-Series Merges with merge_ordered()
For time-series data where order matters and gaps need to be filled, merge_ordered is a better fit than a standard merge:
pd.merge_ordered(revenue_df, benchmark_df, left_on='report_date', right_on='period', how='left', fill_method='ffill')
merge_ordered is a module-level function rather than a DataFrame method, and it keeps the result sorted by the merge key. The fill_method='ffill' argument is what makes it particularly useful for time series. When there is a null in a column after the merge, forward fill replaces it with the most recent non-null value above it. This is the standard approach when two datasets report at different frequencies: a monthly metric merged with a daily one will have many null rows on non-monthly dates, and forward fill carries the most recent monthly value forward until the next one arrives.
When merging on multiple columns, the column order controls the sort order, which in turn controls how forward fill propagates:
pd.merge_ordered(revenue_df, headcount_df, on=['region', 'report_date'], fill_method='ffill')
Listing region before report date sorts by region first, then by date within each region. Forward fill therefore carries values forward within each region’s own time series and never bleeds a value from one region into another.
Approximate Matching with merge_asof()
When timestamps do not align exactly between two sources, merge_asof finds the nearest match rather than requiring an exact one:
pd.merge_asof(exchange_a, exchange_b, on='timestamp', suffixes=('_a', '_b'), direction='nearest')
For each row in the left table, merge_asof looks for the closest row in the right table by timestamp. The direction argument controls which side to look: backward finds the most recent right-side row at or before the left-side timestamp, forward finds the next one at or after, and nearest picks whichever is closer in either direction. Both inputs must be sorted by the merge key before calling this function. This is the standard tool for aligning data from sources that tick at similar but non-identical times, such as trades from two different exchanges.
Filtering with .query()
After merging, filtering the result is often the next step. .query() offers a readable alternative to bracket-based filtering:
df.query('report_date >= "2023-01-01"')df.query('region == "EMEA"')
Compare df[df['report_date'] >= '2023-01-01'] with df.query('report_date >= "2023-01-01"'). The query version is cleaner, especially with multiple conditions, because you do not need to repeat the DataFrame name or wrap each condition in parentheses. Column names appear bare inside the string. To reference a Python variable from outside the string, prefix it with @:
target_region = "EMEA"df.query('region == @target_region')
Reshaping with .melt()
Sometimes data arrives in wide format, where each column represents a category rather than a variable. A table with one column per month is a common example. Most analysis tools prefer long format, where each row is one observation.
metrics_tall = metrics_wide.melt( id_vars = ['store_id'], var_name = 'channel', value_name = 'conversion_rate')
id_vars specifies which columns should stay put as row identifiers. The remaining columns, which in this case are the channel names, get unpivoted: each one becomes a row, with its name going into the column specified by var_name and its value going into the column specified by value_name. The result has more rows and fewer columns. This long format is what plotting libraries and further groupby operations expect, where every row represents one observation and every column represents one variable.
Reshaping with pivot_table()
pivot_table goes in the opposite direction, from long to wide:
revenue_pivot = revenue_long.pivot_table( 'revenue_per_user', 'period', 'region')
The first argument is the values column that fills the cells. The second is the column whose unique values become the row labels. The third is the column whose unique values become the column headers. The result is a grid where each cell holds the aggregation, mean by default, for that row and column combination. This format is useful for displaying results and feeding into spreadsheet tools, though it is usually less practical for further programmatic analysis than the long format it came from.
Quick Reference
| Task | Code |
|---|---|
| Inner join | a.merge(b, on='col') |
| Left join | a.merge(b, on='col', how='left') |
| Right join | a.merge(b, on='col', how='right') |
| Outer join | a.merge(b, on='col', how='outer') |
| Different column names | a.merge(b, left_on='x', right_on='y') |
| Rename duplicate columns | a.merge(b, on='col', suffixes=('_a', '_b')) |
| Chain three merges | a.merge(b, on='k1').merge(c, on='k2') |
| Self join | df.merge(df, on='id', suffixes=('_1', '_2')) |
| Find unmatched rows | df[df['col'].isnull()] |
| Stack tables | pd.concat([t1, t2, t3]) |
| Stack, shared columns only | pd.concat([...], join='inner') |
| Stack with labels | pd.concat([...], keys=['a', 'b', 'c']) |
| Time-series merge with fill | pd.merge_ordered(a, b, fill_method='ffill') |
| Approximate timestamp match | pd.merge_asof(a, b, on='ts', direction='nearest') |
| Filter rows | df.query('col >= value') |
| Wide to long | df.melt(id_vars=['x'], var_name='col', value_name='val') |
| Long to wide | df.pivot_table('val', 'row_col', 'col_col') |
When to Use Each Tool
Combining tables that share a column? → .merge()Stacking tables on top of each other? → pd.concat()Time-series data where rows need to stay sorted and gaps need filling? → pd.merge_ordered() with fill_method='ffill'Timestamps that almost but do not exactly match? → pd.merge_asof()Filtering a combined result cleanly? → .query()Columns that are actually values and need to become rows? → .melt()Rows that need to be summarised into a grid? → .pivot_table()
See you soon
[…] DataFrame has an index that labels its rows. Setting a column as the index promotes it out of the data and […]
[…] Reshaping data: https://datalad.co.uk/combining-reshaping-dataframes-pandas/ […]