Python Data Structures and Collections

Half of clean Python is picking the right container. Learn lists, tuples, dicts, and sets, plus Counter, defaultdict, namedtuple, and dataclass, and the idioms that make them effortless.

Half of writing clean Python is picking the right container for the job. The language gives you eight or nine ways to hold a collection of values, the list, tuple, dict, and set built in, plus Counter, defaultdict, namedtuple, and dataclass from the standard library, and each has a personality. Lists are flexible workhorses, tuples are frozen records, dicts are lookup tables, sets are deduplicators, and the specialised types handle common patterns so you do not have to. The other half is the small idioms, zip, enumerate, comprehensions, .get, .items, that make working with these containers feel effortless. This guide walks through all of them and, more importantly, when to reach for each.

Lists: the mutable workhorse

Lists are ordered, mutable, indexable, and can hold mixed types, which makes them the default container. The two methods most worth contrasting are append and extend, because confusing them is one of the most common beginner bugs.

names = ['Ximena', 'Aliza', 'Ayden', 'Calvin']
names.extend(['Rowen', 'Sandeep']) # adds two items
position = names.index('Aliza')
names.pop(position)

append adds exactly one item, so appending a list nests it, while extend adds every item from an iterable, flattening one level in. Whenever the thing you are adding is itself a list and you want its elements merged in, reach for extend. Removing has three flavours that answer different questions. pop(i) removes by index and gives you the value back, which is useful when you want to process what you removed. remove(x) removes the first matching value and returns nothing. And del lst[i]deletes a slot without returning anything and is marginally faster. All three modify the list in place.

When you need the items in order, know the difference between sorted(lst), which returns a new sorted list and leaves the original untouched, and lst.sort(), which mutates in place and returns nothing. Use sorted when you need both versions and .sort() when you are done with the original. And whenever you find yourself writing the build-up pattern of an empty list, a loop, and an append, that is almost always a list comprehension waiting to happen.

chosen = [row[3] for row in records]

Read left to right, this says “for each row in records, take element three, collected into a list.” It replaces three lines with one, and it runs faster because comprehensions are optimised inside the interpreter.

Tuples and the idioms that pair with them

Tuples are immutable ordered sequences, the right choice for fixed-shape records that should not change. They come with Python’s quirkiest gotcha: commas, not parentheses, make tuples.

normal = 'simple' # a string
oops = 'trailing comma', # a one-element tuple

That trailing comma silently turns a value into a one-element tuple even without parentheses, which can happen by accident. The rule to memorise is that () is the empty tuple, (x,) is a one-tuple, and (x, y) is a two-tuple, and (5) is just the number five in parentheses while (5,) is a tuple.

Tuples shine alongside three iteration idioms. zip pairs elements from two or more iterables in lockstep like a zipper, stopping at the shortest. enumerate wraps an iterable to yield index-and-item pairs, solving the “I need to know which iteration I am on” problem. And unpacking assigns the pieces of a tuple to named variables in one move.

for rank, (first, last) in enumerate(zip(first_names, last_names)):
print(f"Rank {rank}: {first} {last}")

Unpacking has a powerful extension in starred assignment, where a, *rest = [1, 2, 3, 4] puts the first element in a and everything else in rest as a list, the same “collect the rest” idea as *args in a function signature.

Strings into output

Two string tools come up constantly when producing output. The f-string is the modern way to embed values, evaluating any expression inside its braces, so f"{total:.2f}" formats inline and f"{x=}" even prints the variable name and value for debugging. And str.join is the canonical way to combine a list of strings, called on the separator with the iterable as its argument.

joined = ", ".join(items)

The grammar feels backwards at first, but the separator is the constant and the list is the variable thing, and this design also lets you join with a newline or with nothing at all. It matters for performance too: joining builds the result once, while repeatedly adding strings in a loop rebuilds the whole string each time, so for more than a few items always join.

Dicts: the data workhorse

Whenever your data is shaped like “X is associated with Y”, a customer with their orders, a city with its stores, a code with its record, you reach for a dict. Building one from pairs is a standard loop, and reading it back has a crucial safety choice.

stores_by_city = {}
for city, store in records:
stores_by_city[city] = store

Direct indexing with d['key'] is fast but fragile, crashing with a KeyError when the key is absent, whereas .get('key') returns None for a missing key, or a default you supply as a second argument. Use indexing when you are certain the key exists and a missing one should be a loud error, and use .get with an explicit default for robust, readable code like name = user.get("name", "Anonymous"). Adding works either with single assignment or with .update() to merge another dict, and removing mirrors lists, with pop(key, default) when you want the value back safely and del d[key] when you do not.

Iterating a dict yields its keys by default, which is why sorted(d) gives you the keys in order, but the workhorse for walking a dict is .items(), which yields key-value pairs to unpack on the fly.

for field, value in record.items():
print(field, value)

This is cleaner and faster than looking each value up by key inside the loop. Membership testing with key in d is genuinely fast, running in constant time on average no matter how large the dict, because Python hashes the key and looks it up directly, but note it checks keys only, never values, and writing if k in d.keys() is redundant since plain if k in d does the same thing faster. Real data is often nested, dicts inside dicts inside lists, the shape of JSON from an API, and drilling in is just chaining brackets, with the caveat that every level must exist or the chain raises a KeyError, which is exactly when chained .get() calls with defaults earn their keep.

Sets: uniqueness and set arithmetic

Sets are unordered bags of unique items, and they bring two superpowers. The first is deduplication, since converting a list to a set collapses repeats. The second is set arithmetic borrowed from mathematics: union for everything in either set, intersection for what is in both, difference for what is in one but not the other, and symmetric difference for what is in exactly one.

female_species = set(female_species_list)
female_only = female_species.difference(male_species)

Each method has a terse operator equivalent, | for union, & for intersection, - for difference, and ^ for symmetric difference, and the operators chain neatly, so a | b | c unions three sets. Sets also use a hash table internally, so membership testing is constant time, far faster than scanning a list, which makes a set the right container whenever you will check membership repeatedly. Removing has the now-familiar pairing, with discard(x) silently doing nothing if the element is absent and remove(x) raising if it is.

The collections module

The standard library adds four specialised containers that turn common patterns into one-liners.

Counter answers “how many of each?” Hand it any iterable and get back a dict-like mapping of items to counts, replacing the manual counting loop entirely.

from collections import Counter
sex_counts = Counter([row['sex'] for row in records])
print(sex_counts.most_common(3))

Its .most_common(n) returns the top items as sorted pairs, and indexing a missing key returns zero rather than raising, which spares you defensive code.

defaultdict is a dict that auto-creates a default value for any missing key, which removes the boilerplate guard you would otherwise write when grouping items.

from collections import defaultdict
weights_by_species = defaultdict(list)
for species, mass in records:
weights_by_species[species].append(mass)

Without it, you would have to check whether each key exists and initialise an empty list before the first append. The factory can be list for grouping, int for counting since missing keys default to zero, or even a lambda for a custom default. The key detail is that it takes a callable, not a value, which is why list and int work directly and a custom string needs wrapping in a lambda.

namedtuple fixes the readability problem of plain tuples, where you have to remember whether the species was at index zero or two. You define a tuple class with named fields once, then access them by name.

from collections import namedtuple
Record = namedtuple('Record', ['species', 'sex', 'body_mass'])
entry = Record('Adelie', 'FEMALE', 3400)
print(entry.species) # clearer than entry[0]

It keeps all the speed and immutability of a tuple while being self-documenting and refactoring-safe, since reordering fields will not silently break attribute access the way it would break numeric indexing.

dataclass is namedtuple’s grown-up cousin, a class for holding related fields that is mutable by default, supports methods and properties, takes default values easily, and uses type hints.

from dataclasses import dataclass
@dataclass
class WeightEntry:
species: str
flipper_length: int
body_mass: int
@property
def mass_to_flipper_ratio(self):
return self.body_mass / self.flipper_length

The decorator auto-generates the constructor, a readable representation, and field-based equality from those few declarations, and the @property turns a method into attribute-style access, so entry.mass_to_flipper_ratio is computed on demand but looks like a stored field. Modern Python projects increasingly prefer dataclasses for their type hints and method support, reserving namedtuple for when immutability and minimal memory genuinely matter.

A few language essentials

Three smaller features come up constantly with these containers. Python has a strong notion of truthiness, where empty containers, zero, the empty string, and None are all falsey and everything else is truthy, which is why the Pythonic check is if my_list: rather than if len(my_list) > 0:. There are two equality questions: == asks whether two things have the same value, while is asks whether they are the same object in memory, and you reserve is for the canonical is None check and for the rare case where you truly need the literal boolean True rather than just a truthy value, since a truthy string is not the same object as True. And there are two division operators, / for true division which always returns a float, and // for floor division which discards the decimal part, with the surprise that floor division rounds toward negative infinity, so -5 // 2 is minus three, not minus two.

Choosing the right container

It all comes down to a short set of questions. If you need order, mutability, and duplicates, use a list. If you need order but immutability, a tuple. If you need unique elements and do not care about order, a set. For key-to-value lookup, a dict. To count occurrences, a Counter. For a dict whose missing keys should auto-initialise, a defaultdict. For a lightweight record with named fields, a namedtuple. And for a record that needs methods or mutable fields, a dataclass. Get that choice right and the small idioms, comprehensions, zip, enumerate, .get, .items, do the rest, and a large share of your everyday Python stops being a struggle and starts being a pleasure.

See you soon.

View Comments (1)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading