How to Process Data Too Big for Memory with Python Generators

Some Python features you can ignore for years and never miss. Lazy evaluation is not one of them. The moment a dataset stops fitting in memory, or a loop starts feeling clunky, you end up reaching for exactly the tools in this article: iterators, comprehensions, and generators. They share one underlying idea, producing values on demand instead of all at once, and once that idea clicks, a whole family of Python syntax suddenly makes sense.

Iterables vs. Iterators: A Distinction That Sounds Pedantic but Isn’t

An iterable is anything you can walk through: a list, a string, a dict, a range. Think of it as a deck of cards sitting on the table. An iterator is the thing that walks through it, like your finger moving through the deck, remembering exactly where you are. The distinction sounds academic until you realize the entire for loop machinery is built on it.

			
pioneers = ["ada", "grace", "katherine"]
for name in pioneers:
    print(name)
it = iter(pioneers)
print(next(it))   # "ada"
print(next(it))   # "grace"
print(next(it))   # "katherine"

		

When you write for name in pioneers, Python silently does two things: it grabs an iterator from the list by calling iter(), then calls next() on it repeatedly until the values run out. Doing it manually makes the hidden machinery visible. One more next(it) after the last value raises a StopIteration error, because an exhausted iterator is done for good. The list itself is fine; you just need a fresh iterator to walk it again.

The payoff of this design shows up with range:

			
r = iter(range(10 ** 100))
print(next(r))   # 0
print(next(r))   # 1

range looks like it generates a list of numbers, but it never does. It is a tiny object that knows the rule “start, stop, step” and computes each value only when asked. range(10**100) does not allocate memory for a googol of numbers, which would crash any computer ever built; it just remembers the bounds and produces values on demand. This laziness is why range(10)and range(10**100) cost exactly the same to create, and it is the same trick that makes generators so powerful later in this article.

The Built-in Iterator Toolkit

Two built-ins handle the most common loop patterns so you never have to index manually.

enumerate solves “I need the item and its position” by wrapping any iterable so each step hands you an (index, item) tuple:

			
pioneers = ["ada", "grace", "katherine"]
list(enumerate(pioneers))           # [(0, 'ada'), (1, 'grace'), (2, 'katherine')]
list(enumerate(pioneers, start=1))  # [(1, 'ada'), (2, 'grace'), (3, 'katherine')]
for index, name in enumerate(pioneers):
    print(index, name)

		

The clean way to consume it is tuple unpacking right in the loop header, which splits each pair into two named variables. Compare that to the clunky alternative, for i in range(len(pioneers)) followed by an index lookup; enumerate is shorter, works on any iterable rather than just things with a length, and the start argument lets the numbering begin at 1 for human-facing output, since people count from one no matter what programmers prefer.

zip pairs up multiple sequences like a zipper closing a jacket, element 0 with element 0, element 1 with element 1, and so on:

			
names  = ["ada", "grace", "katherine"]
fields = ["mathematics", "computing", "aeronautics"]
years  = [1843, 1952, 1962]
list(zip(names, fields, years))
# [('ada', 'mathematics', 1843), ('grace', 'computing', 1952), ('katherine', 'aeronautics', 1962)]
for name, field, year in zip(names, fields, years):
    print(name, field, year)
z = zip(names, fields)
unzipped_names, unzipped_fields = zip(*z)

		

Three input lists give you 3-tuples; five would give you 5-tuples. One behavior to know: zip stops at the shortest input and silently discards the extras from longer ones, so reach for itertools.zip_longest when you need padding instead. The last two lines show the unzip trick, and it is worth pausing on. The * operator spreads an iterable into separate arguments, so zip(*z)passes each tuple as its own argument back to zip, which transposes the data and recovers the original sequences.

These tools compose with the rest of Python because most built-ins accept any iterable, not just lists. sum(range(10, 21))accumulates without ever building a list, and dict(zip(keys, vals)) builds a lookup table in one line by pairing keys with values and feeding the resulting tuples straight into dict.

Comprehensions: Loops Distilled to Their Essence

A list comprehension is shorthand for “build a list by transforming each item of another iterable.” The pattern reads almost like English:

			
squares = [x ** 2 for x in range(10)]
long_names = [name for name in names if len(name) >= 7]
padded = [name if len(name) >= 7 else "" for name in names]

The first line replaces the verbose three-line dance of creating an empty list, looping, and appending. The second and third lines look similar but do completely different things, and the position of the if is the signal. An if at the end, after the for, is a filter: items failing the test are dropped entirely, so the output can be shorter than the input. An if-else at the beginning, before the for, is a conditional expression: every item produces exactly one output, the condition just chooses which form it takes.

			
[x for x in nums if x > 0]          # filter: keeps positives, drops the rest
[x if x > 0 else 0 for x in nums]   # transform: same length, negatives clamped to 0

Same letters, totally different meaning. Use the first when the goal is “keep some, drop others” and the second when the goal is “transform everything, but pick the transformation per item.” This is one of the easiest things in Python to get subtly wrong, precisely because the syntax looks so similar.

Dict comprehensions follow the same shape with curly braces and a key-value pair:

			
name_lengths = {name: len(name) for name in names}
# {'ada': 3, 'grace': 5, 'katherine': 9}

Perfect for building lookup tables from lists or transforming a dict’s values, and the same trailing if filter works here too.

Comprehensions also nest, which is how you build a matrix in one line:

matrix = [[col for col in range(5)] for row in range(5)]

Read nested comprehensions inside-out: the inner part builds one row of five numbers, and the outer for repeats that construction five times, giving a list of five lists. A word of caution from experience: past two levels of nesting, comprehensions stop being clever and start being hostile. Switch back to a regular loop for anything deeper.

Generators: The Same Logic, Without the Memory Bill

Here is the most consequential pair of brackets in Python:

			
eager = [name for name in names if len(name) >= 7]   # list comprehension
lazy  = (name for name in names if len(name) >= 7)   # generator expression

Square brackets are eager: the whole comprehension runs immediately and every result is stored in a list in memory. Round brackets are lazy: you get back a generator, which is essentially a recipe for producing the values, and nothing is computed until you ask. For ten items the difference is irrelevant. For ten million, the eager version eats your RAM while the lazy one sips along producing a single value at a time.

			
big = (num for num in range(1000000))
print(next(big))   # 0
print(next(big))   # 1

A million potential values, almost no memory, because only the current value exists at any moment. One caveat to internalize early: generators are one-shot. Once exhausted, they cannot be rewound. If you need to iterate twice, either materialize a list or create a fresh generator.

When the logic gets too involved for a one-line expression, write a generator function. The keyword that changes everything is yield:

			
def get_lengths(items):
    for item in items:
        yield len(item)
for length in get_lengths(["ada", "grace", "katherine"]):
    print(length)   # 3, 5, 9

		

Where return computes a final value and exits, yield produces a value and pauses, preserving the function’s entire state, local variables and all. The next request resumes execution exactly where it stopped. Stranger still, calling the function does not run its body at all; it just hands back a generator object, and the body only executes as values get consumed. This is how you write functions that produce streams instead of collections.

The classic real-world payoff is reading a file of any size with constant memory:

			
def stream_lines(file_object):
    while True:
        data = file_object.readline()
        if not data:
            break
        yield data
with open("server_logs.csv") as f:
    lines = stream_lines(f)
    print(next(lines))   # first line
    print(next(lines))   # second line

		

The function reads one line, yields it, and pauses. The caller processes that line and asks for another; the function wakes, reads, yields, pauses again, until readline returns an empty string at the end of the file and the loop breaks. This works identically on a 1 KB file and a 100 GB file, because at any moment exactly one line lives in memory. The pattern generalizes to any stream of records: log processing, network packets, database cursors.

Processing Files Bigger Than Your RAM

pandas has this laziness built in. Normally pd.read_csv loads the entire file into a DataFrame, which simply fails when the file outgrows your memory. The chunksize argument changes the contract:

			
import pandas as pd
reader = pd.read_csv("transactions.csv", chunksize=1000)
print(next(reader))   # first 1000 rows as a DataFrame
for chunk in pd.read_csv("transactions.csv", chunksize=1000):
    print(chunk.shape)

		

With chunksize set, read_csv returns an iterator that yields 1000-row DataFrames on demand. Each chunk is a full-powered DataFrame, so everything you know about pandas applies; the file just streams through your script without ever fully landing in memory. This is the standard recipe for analyzing files in the tens-of-gigabytes range on an ordinary laptop.

The core streaming idiom puts an accumulator outside the loop and chunked processing inside. Here is a count of orders per country across a file too big to load:

			
counts = {}
for chunk in pd.read_csv("orders.csv", chunksize=10000):
    for country in chunk["country"]:
        if country in counts:
            counts[country] += 1
        else:
            counts[country] = 1

		

The outer loop walks the file chunk by chunk, the inner loop walks rows within the current chunk, and the dictionary quietly accumulates the global tally. By the end, counts holds totals for the entire file even though no more than one chunk ever existed in memory. The increment-or-create logic cleans up nicely with collections.Counter, whose update method does it in one call per chunk.

For raw file work, one habit keeps you out of trouble:

			
with open("orders.csv") as file:
    file.readline()   # skip the header
    for line in stream_lines(file):
        row = line.split(",")

The with statement guarantees the file gets closed when the block exits, even if an exception fires halfway through. Leaked file handles cause subtle, miserable bugs, so always prefer with open(...) over a bare open() and a manual close() you will eventually forget. And a small honesty note about that last line: splitting on commas works for simple files but breaks the moment a field contains a quoted comma. For real CSV data, use the csv module or pandas, which handle the edge cases for you.

The Quick Reference

Task	Code
Make an iterator	`it = iter(my_list)`
Get the next value	`next(it)`
Add an index to a loop	`enumerate(iterable, start=1)`
Pair up lists	`zip(list_a, list_b)`
Zip into a dict	`dict(zip(keys, vals))`
Unzip	`a, b = zip(*zipped)`
List comprehension	`[expr for x in it if cond]`
Dict comprehension	`{k: v for x in it}`
Generator expression	`(expr for x in it if cond)`
Generator function	`def f(): ... yield value`
Read a CSV in chunks	`pd.read_csv(file, chunksize=1000)`

The Mental Model

When you need all the values right now and the data is small, use a list comprehension with square brackets. When values should arrive one at a time because the data is large, use a generator, either round brackets or a yield function. When a loop needs positions, reach for enumerate; when it needs to walk two sequences in lockstep, reach for zip; and when a file outgrows memory, loop over pd.read_csv with a chunksize.

If you keep only one rule from this article, keep this one: the moment data might not fit in memory, switch to a generator. Pulling one value at a time is the difference between a script that streams a 100 GB file on a laptop and one that dies trying to load it.

See you soon.

GTM Single Page Application Tracking: Getting Pageviews Right When Pages Never Reload

GTM Zones: Managing Tags Across Teams, Partners and Multiple Sites

GTM Tag Sequencing: Controlling the Order Your Tags Fire

XGBoost: A Practical Guide to Extreme Gradient Boosting

Python Iterators, Comprehensions and Generators

Iterables vs. Iterators: A Distinction That Sounds Pedantic but Isn’t

The Built-in Iterator Toolkit

Comprehensions: Loops Distilled to Their Essence

Generators: The Same Logic, Without the Memory Bill

Processing Files Bigger Than Your RAM

The Quick Reference

The Mental Model

Related

Leave a ReplyCancel reply

Recommended for You

Object-Oriented Programming in Python

Statistical Visualisation in Python with Seaborn

GTM Single Page Application Tracking: Getting Pageviews Right When Pages Never Reload

GTM Zones: Managing Tags Across Teams, Partners and Multiple Sites

GTM Tag Sequencing: Controlling the Order Your Tags Fire

XGBoost: A Practical Guide to Extreme Gradient Boosting

Python Iterators, Comprehensions and Generators

Iterables vs. Iterators: A Distinction That Sounds Pedantic but Isn’t

The Built-in Iterator Toolkit

Comprehensions: Loops Distilled to Their Essence

Generators: The Same Logic, Without the Memory Bill

Processing Files Bigger Than Your RAM

The Quick Reference

The Mental Model

Related

Leave a ReplyCancel reply

Subscribe to My Newsletter

Recommended for You

Object-Oriented Programming in Python

Statistical Visualisation in Python with Seaborn

Discover more from Datalad - Data Science and ML