Intermediate PyTorch: Datasets, CNNs, RNNs, and Multi-Branch Models

Go beyond PyTorch basics. Build custom Datasets, OOP models, CNNs for images, RNNs, LSTMs and GRUs for sequences, plus multi-input and multi-output architectures for real-world problems.

The fundamentals of PyTorch get you a working model, but they lean on conveniences that quietly limit you. nn.Sequentialonly handles a straight stack of layers, and loading data by hand stops scaling the moment your inputs are images or sequences. This article is about the patterns real projects use: writing your own dataset class, building models as proper objects, and reaching for the architectures that suit images, time series, and inputs that come in more than one shape. If you have the basic training loop in your fingers, everything here slots on top of it.

Loading data the object-oriented way

The professional way to feed data into PyTorch is a custom dataset class. You subclass Dataset and implement exactly three methods. The constructor loads the data and sets up any transforms, __len__ reports how many samples exist, and __getitem__ returns a single sample by its index.

import pandas as pd
from torch.utils.data import Dataset
class SensorDataset(Dataset):
def __init__(self, csv_path):
super().__init__()
df = pd.read_csv(csv_path)
self.data = df.to_numpy()
def __len__(self):
return self.data.shape[0]
def __getitem__(self, idx):
features = self.data[idx, :-1]
label = self.data[idx, -1]
return features, label

Think of this as a contract with the DataLoader. You promise to tell it how many samples there are and how to fetch any one of them, and in return it handles batching, shuffling, and parallel loading for free. Reading the file once in the constructor and storing it as a NumPy array matters, because then every later access is a fast in-memory slice rather than a fresh disk read. Wrapping it is the same as before: instantiate the dataset, pass it to a DataLoader, and you can grab a single batch with next(iter(loader)) to sanity-check the shapes before committing to a full run.

Models as classes, not just stacks

For anything beyond a linear pipeline you build the model by subclassing nn.Module. The pattern splits the model into two clear concerns. The constructor declares which layers exist, and the forward method declares the order they run in and how data flows between them.

import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(9, 16)
self.fc2 = nn.Linear(16, 8)
self.fc3 = nn.Linear(8, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = torch.sigmoid(self.fc3(x))
return x

The quiet magic is that the instant you assign a layer to self.something, PyTorch registers its weights as trainable parameters. That is what makes model.parameters() find every weight without you listing them, and it is why the optimizer just works. The reason to prefer this over nn.Sequential is freedom: a class can take several inputs, branch and rejoin, skip connections, or run conditional logic in forward, none of which a simple stack can express. One small but real gotcha lives here too. If you define a layer as self.rnn in the constructor but reference self.gru in forward, the model crashes, so keep those attribute names in sync.

Helping training start well

Two settings make a surprising difference to how smoothly a network begins learning: how its weights are initialised and which activation sits between layers. Initialisation runs once before training and its job is to place the weights where gradients can actually flow. Kaiming, also called He initialisation, is built for ReLU-family activations because it compensates for the fact that ReLU zeroes out half its inputs, while Xavier suits the saturating activations like sigmoid and tanh. In PyTorch the functions carry a trailing underscore, as in init.kaiming_uniform_, which is the library’s convention for an operation that modifies the tensor in place rather than returning a new one.

On the activation side, ReLU remains the default for hidden layers, but ELU is a common upgrade because it is smooth for negative inputs rather than hard-zeroing them, and Leaky ReLU exists for the same reason, to stop neurons getting permanently stuck at zero. Sigmoid and softmax stay where they belong, on the output layer of binary and multi-class classifiers respectively.

Batch normalization

Batch normalization is one of the highest-value additions to a deep network. As data passes through many layers, the distribution of values at each layer keeps shifting as the weights update, a problem known as internal covariate shift, and it slows everything down. A batch norm layer re-centres and rescales each layer’s outputs to roughly zero mean and unit variance across the batch before the activation sees them. The payoff is faster training, much less sensitivity to how you initialised the weights, and a touch of regularisation thrown in.

The convention is to apply it between the linear layer and the activation, in the order linear, then batch norm, then activation. You declare a nn.BatchNorm1d whose size matches the output of the layer feeding it, and that size match is mandatory, since a mismatch throws a shape error. Use BatchNorm1d for tabular and sequential data and BatchNorm2d for images.

Convolutional networks for images

Images need a different shape of model. A convolutional network has two parts: a feature extractor built from convolution and pooling layers, and a classifier built from linear layers. The feature extractor slides small learnable filters across the image, with early layers picking up edges and textures and later layers assembling them into shapes. Pooling shrinks the spatial grid after each stage, keeping the strongest signal in each region and making the network tolerant of small shifts in position. Once the grid is small enough, a flatten step turns it into a flat vector the classifier can read.

class Net(nn.Module):
def __init__(self, num_classes):
super().__init__()
self.feature_extractor = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ELU(),
nn.MaxPool2d(kernel_size=2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ELU(),
nn.MaxPool2d(kernel_size=2),
nn.Flatten(),
)
self.classifier = nn.Linear(64 * 16 * 16, num_classes)
def forward(self, x):
x = self.feature_extractor(x)
return self.classifier(x)

The number that trips everyone up is the input size of that first linear layer, and you have to compute it by hand. Each MaxPool2d(2) halves the height and width, so a 64 by 64 image becomes 32 by 32 after the first pool and 16 by 16 after the second. With 64 channels at that point, the flattened vector is 64 times 16 times 16, which is 16,384. Get this wrong and the model will not run, so recompute it whenever you change the architecture.

Getting the images in is easier than you might expect. If they are already sorted into one subfolder per class, ImageFolderreads the directory structure as the labels, with each folder name becoming a class. You pair it with a transform pipeline built by transforms.Compose, which at minimum converts each image to a tensor and resizes it to a consistent shape.

from torchvision.datasets import ImageFolder
from torchvision import transforms
train_transforms = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(45),
transforms.ToTensor(),
transforms.Resize((64, 64)),
])
dataset_train = ImageFolder('leaves_train', transform=train_transforms)

The random flips and rotations above are data augmentation, and they expand your effective training set by showing the model the same image under different transformations each epoch. A leaf rotated thirty degrees is still the same leaf, but without augmentation the model might never see that variation and could stumble on rotated test images. The crucial rule is to augment the training set only. Validation and test data should pass through a plain transform with no randomness, or your evaluation numbers will jump around meaninglessly. One more practical note when you go to view an image: PyTorch stores tensors as channels, height, width because convolutions prefer it, while matplotlib expects height, width, channels, so you reorder the axes with .permute(1, 2, 0) before plotting.

Measuring more than accuracy

For multi-class problems, accuracy alone hides too much. Precision asks, of everything the model labelled as a given class, what fraction really was that class, while recall asks, of everything that truly was that class, what fraction the model caught. A model can be strong on one and weak on the other, so reporting both is far more honest. The mechanical step is taking the class with the highest score for each prediction, which torch.max(outputs, 1) does, the multi-class equivalent of thresholding a sigmoid at 0.5.

from torchmetrics import Precision
metric = Precision(task="multiclass", num_classes=7, average=None)

The average argument is worth understanding. Micro averaging pools every sample together for an overall, accuracy-like figure, macro takes an unweighted mean that treats every class equally regardless of size, and weighted accounts for class imbalance. Setting it to None is the most diagnostic of all, because it returns one score per class. Pair those scores back to their names through the dataset’s class_to_idx mapping and you can see exactly which class is dragging the model down, which tells you precisely where to gather more data or look for confusion.

Sequences and recurrent networks

Time series and other sequences need to be reframed as a supervised problem before a model can learn from them. The standard move is a sliding window: for each position, take a fixed number of preceding values as the input and the next value as the target, stepping the window across the whole series to generate many overlapping examples.

def create_sequences(df, seq_length):
xs, ys = [], []
for i in range(len(df) - seq_length):
xs.append(df.iloc[i:(i + seq_length), 1].to_numpy())
ys.append(df.iloc[i + seq_length, 1])
return np.array(xs), np.array(ys)

The model that consumes these windows is recurrent. A plain RNN walks through the sequence one step at a time, carrying a hidden state forward like a running summary of everything seen so far, and after the final step you take that last summary and map it to a prediction. The problem is that a single hidden vector cannot hold much, so early information fades and long sequences suffer from vanishing gradients. The LSTM fixes this by adding a second memory, the cell state, which runs the length of the sequence like a conveyor belt while gates decide what to add, remove, and read at each step. That is why an LSTM’s signature needs a tuple of two initial states, the hidden state and the cell state, where a plain RNN needs only one.

class Net(nn.Module):
def __init__(self):
super().__init__()
self.lstm = nn.LSTM(input_size=1, hidden_size=32,
num_layers=2, batch_first=True)
self.fc = nn.Linear(32, 1)
def forward(self, x):
h0 = torch.zeros(2, x.size(0), 32)
c0 = torch.zeros(2, x.size(0), 32)
out, _ = self.lstm(x, (h0, c0))
return self.fc(out[:, -1, :])

The GRU is the middle path. It merges the cell state back into the hidden state and condenses the gates from three down to two, giving similar long-range memory with fewer parameters and faster training, and in code it takes only a single initial state like the plain RNN. As a rule of thumb, start with an LSTM for forecasting or language tasks, switch to a GRU if training is too slow, and skip the plain RNN except for short sequences.

Two details cause most recurrent-network bugs. The first is batch_first=True, which you almost always want, because the default ordering puts the sequence length first and confuses everyone. The second is reshaping. The DataLoader hands you sequences shaped as batch by length, but the recurrent layers expect a third dimension for the number of features per step, which is one for a univariate series. The fix is a reshape, but never hard-code the batch size as seqs.view(32, 96, 1), because the final batch is usually smaller and that will crash. Use seqs.view(seqs.size(0), 96, 1) or set drop_last=True on the DataLoader. Other than the reshape, the recurrent training loop is identical to any regression loop.

When inputs come in different shapes

Sometimes a single example is more than one thing, say an image together with some categorical metadata. You cannot simply add a 64 by 64 image to a 30-element one-hot vector, because they have incompatible formats. The solution is to give each input its own sub-network that turns it into a fixed-length vector, then concatenate those vectors and let a single head make the final call. Your custom dataset just returns a longer tuple from __getitem__, and the DataLoader batches each position separately.

class Net(nn.Module):
def __init__(self):
super().__init__()
self.image_layer = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=3, padding=1),
nn.MaxPool2d(2),
nn.ELU(),
nn.Flatten(),
nn.Linear(16 * 32 * 32, 128),
)
self.script_layer = nn.Sequential(
nn.Linear(30, 8),
nn.ELU(),
)
self.classifier = nn.Linear(128 + 8, 964)
def forward(self, x_image, x_script):
x_image = self.image_layer(x_image)
x_script = self.script_layer(x_script)
x = torch.cat((x_image, x_script), dim=1)
return self.classifier(x)

The example above takes a handwritten glyph image plus a one-hot vector saying which of thirty writing systems it belongs to, and predicts which of 964 glyphs it is. Each branch produces its own embedding, torch.cat joins them side by side along the feature dimension, and the head reads both at once. The concatenation must be on dim=1, the feature dimension, not dim=0, which is the batch.

When you need more than one prediction

The mirror image of multi-input is multi-output, where one input yields several predictions. Here you share a single backbone and split it into multiple heads. The backbone runs once and produces a shared feature vector, the model’s understanding of the input, and each head reads that same vector to answer a different question.

def forward(self, x):
features = self.image_layer(x)
out_script = self.classifier_script(features)
out_glyph = self.classifier_glyph(features)
return out_script, out_glyph

This is efficient because the expensive feature extraction is not duplicated, and the two tasks often reinforce each other, since the backbone must learn representations useful for both. Training has one subtlety. Backpropagation needs a single number to minimise, and you have two losses, so you add them together. Because both losses live in the same computational graph, the combined gradient flows back through both heads and the shared backbone in one pass. If one task matters more, weight the sum, for example taking 0.7 of one loss and 0.3 of the other, which simply tells the optimizer to push harder on the head you care about. Evaluation runs both heads in a single pass, with a separate metric object accumulating results for each.

Choosing the right architecture

The shape of your data points straight at the architecture. Rows of tabular features call for a fully connected network of linear layers. Images call for a convolutional network of convolution, pooling, and a linear head. Sequences and time series call for a recurrent network, an LSTM by default, a GRU when you need speed, and a plain RNN only for short sequences. When an example carries several different inputs, give each its own branch and concatenate. When it needs several predictions, share a backbone and split into heads.

Alongside the architecture, keep a small stabilisation toolkit in mind. Kaiming or Xavier initialisation gets training off to a clean start, batch normalisation smooths it out, ELU or Leaky ReLU sidestep the dying ReLU problem, dropout and weight decay fight overfitting, gradient clipping tames the exploding gradients that plague recurrent networks, and a learning rate scheduler helps when progress plateaus. A couple of final correctness notes worth internalising: prefer BCEWithLogitsLoss over a sigmoid followed by BCELoss for numerical stability, and when you load grayscale images with .convert('L'), your first convolution must expect a single input channel.

Master these patterns and PyTorch stops being a collection of special cases. It becomes a small set of composable ideas, a dataset that knows how to serve itself, a model that declares its parts and their order, and a handful of layer types matched to the shape of the problem.

See you soon.

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading