PyTorch Deep Learning Fundamentals

PyTorch looks like a wall of objects, but it is one repeatable loop. Learn the fundamentals: tensors, layers, activations, loss, optimizers, the training loop, dropout, and hyperparameter tuning.

Deep learning frameworks can feel like a wall of unfamiliar objects, but PyTorch is built on a workflow that is the same every single time. You wrap your data in tensors, stack a few layers into a model, pick a loss function and an optimizer, then run a short loop that does four things over and over until the model gets good. Once that loop is in your fingers, every PyTorch project you ever read is just a variation on it. This article walks the whole path from a raw tensor to a trained, saved model, and explains why each piece is there.

Tensors, the thing everything is made of

A tensor is PyTorch’s core data structure, and it is best understood as a NumPy array with two superpowers: it can live on a GPU, and it can remember the operations performed on it so that gradients can be computed automatically. That second property is what makes training possible at all.

You create one by handing PyTorch some data, and it infers the shape and the numeric type for you.

import torch
readings = torch.tensor([[72, 75, 78], [70, 73, 76]])
print(readings.shape) # torch.Size([2, 3])
print(readings.dtype) # torch.int64

Tensors do arithmetic element by element, so adding a same-shaped tensor adds each matching cell at once, no loops required. Beyond building from a list you will constantly reach for torch.zerostorch.onestorch.randn for random normal values, and torch.from_numpy to share memory with an existing array. The attributes worth knowing from day one are .shape.dtype.device, which is either cpu or cuda, and .requires_grad, which is the flag that switches on gradient tracking.

Building a network by stacking layers

A neural network is just layers applied in order, and the simplest way to express that in PyTorch is nn.Sequential, a container that pipes data through each layer from top to bottom. The basic building block is the fully connected layer, nn.Linear(in_features, out_features), which holds a weight matrix and a bias vector and computes a weighted sum of its inputs.

import torch.nn as nn
model = nn.Sequential(
nn.Linear(8, 4),
nn.Linear(4, 1)
)
output = model(torch.Tensor([[2, 3, 6, 7, 9, 3, 2, 1]]))

Think of this as an assembly line. Eight numbers enter the first station and leave as four, those four enter the second station and leave as one, and calling the model on an input fires the line. The one rule you cannot break is that the output size of one layer must equal the input size of the next, so a layer producing four outputs must be followed by a layer expecting four inputs.

It helps to know how many parameters you are training, and you can count them in a line by summing the element count of every parameter tensor.

total = sum(p.numel() for p in model.parameters())

The arithmetic is worth internalising. A linear layer has one weight per input-output pair plus one bias per output, so its parameter count is the inputs times the outputs, plus the outputs. A small three-layer network going from nine features to four to two to one therefore has 40 plus 10 plus 3, which is 53 numbers the optimizer will tune.

Activations, where the intelligence comes from

Here is a fact that surprises beginners: a stack of linear layers with nothing between them is mathematically identical to a single linear layer. You could collapse a hundred of them into one matrix multiplication and get the same answer. Activation functions are what break that collapse by inserting non-linearity, and that non-linearity is precisely what lets a network learn curves, boundaries, and complex relationships.

Each activation has a job. Sigmoid squashes any number into the range zero to one, which makes it the natural choice for the final layer of a binary classifier where you want a probability. Softmax does the same for several classes at once, forcing the scores to compete so they sum to exactly one, which suits the final layer of a multi-class classifier. ReLU, which simply returns the input if positive and zero otherwise, is the standard choice inside the hidden layers because it is cheap and, unlike sigmoid, does not flatten out for large positive values.

relu = nn.ReLU()
sigmoid = nn.Sigmoid()
softmax = nn.Softmax(dim=-1)

That flattening is the reason ReLU dominates hidden layers. Sigmoid and softmax saturate at their extremes, meaning their outputs barely change as the input grows, so the gradient there shrinks toward zero and learning stalls in deep networks. Leaky ReLU is a small refinement that lets a sliver of signal through for negative inputs, which prevents neurons from getting permanently stuck at zero, the so-called dying ReLU problem.

The last layer depends on the question

The body of a network looks much the same across tasks. What changes is the final layer, and it is dictated entirely by what you are predicting. A yes or no question ends in a single neuron pushed through a sigmoid. A choice among several classes ends in one neuron per class pushed through a softmax. A continuous number, like a price or a temperature, ends in a single neuron with no activation at all, because any activation would clamp the output to a fixed range and a real-world quantity has no such ceiling.

# Multi-class classifier
nn.Sequential(
nn.Linear(11, 20),
nn.ReLU(),
nn.Linear(20, 4),
nn.Softmax(dim=-1)
)
# Regression
nn.Sequential(
nn.Linear(11, 16),
nn.ReLU(),
nn.Linear(16, 1)
)

The pairing of output layer and loss function is fixed enough to memorise. Regression uses no final activation and mean squared error loss. Binary classification uses a sigmoid and binary cross-entropy. Multi-class uses a softmax conceptually and cross-entropy loss, with an important caveat covered next.

Loss, the single number you are minimising

A loss function reduces the entire batch of predictions to one number that says how wrong the model was, and training is nothing more than driving that number down. For regression, mean squared error subtracts each prediction from its target, squares it so positive and negative errors do not cancel and so that large mistakes are punished disproportionately, and averages the result. For classification, cross-entropy compares the predicted probability distribution against the true label and grows large when the model is confident about the wrong class.

criterion = nn.CrossEntropyLoss()
loss = criterion(logits, targets)

The single most common PyTorch bug lives right here. nn.CrossEntropyLoss applies softmax internally, so you must feed it the raw scores from the final linear layer, not softmax outputs. If you add a softmax to your model and also use cross-entropy loss, you are applying softmax twice and training will quietly suffer. The fix is to leave the raw logits alone and let the loss function do the conversion.

Gradients, backpropagation, and the optimizer

Once you have a loss, you need to know how to change each weight to reduce it, and that is what gradients tell you. Calling loss.backward() walks backward through every operation that produced the loss, applies the chain rule, and stores in each parameter’s .grad attribute a number saying how much that weight contributed to the error. The optimizer then reads those gradients and nudges every weight in the direction that lowers the loss.

You could do the update by hand, subtracting a small fraction of each gradient from each weight, where that fraction is the learning rate. In practice you let an optimizer do it across all parameters at once.

import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.001)
loss.backward() # fill in the gradients
optimizer.step() # update every weight
optimizer.zero_grad() # reset before the next batch

That last line is not optional. PyTorch accumulates gradients by default, so if you forget to zero them, each batch’s gradients pile on top of the previous batch’s and your updates become nonsense. As for which optimizer to choose, plain stochastic gradient descent is the well-understood baseline, adding momentum speeds convergence and helps escape shallow local minima, and Adam, with its adaptive per-parameter learning rate, is the robust default most people start with. AdamW is Adam with weight decay handled correctly, and is often the better choice when you want regularisation.

Feeding data in batches

Real datasets are too big to push through the model all at once, so you work in batches, and PyTorch gives you two tools that handle the bookkeeping. TensorDataset zips features and labels together so that asking for sample i always returns the matching pair. DataLoader wraps that dataset and yields batches, optionally shuffling the order each epoch.

from torch.utils.data import TensorDataset, DataLoader
dataset = TensorDataset(torch.tensor(X), torch.tensor(y))
loader = DataLoader(dataset, batch_size=32, shuffle=True)

Batch sizes of 32, 64, or 128 are common. Shuffling should be on for training so the model does not learn the order of the data, and off for validation. Two other arguments earn their keep on larger jobs: num_workers parallelises the loading, and drop_last discards a final undersized batch when you need every batch to be the same shape.

The training loop

Everything now comes together in a loop with two levels. The outer loop counts epochs, each a full pass over the data, and the inner loop walks through one batch at a time. Inside, the same five steps repeat in the same order.

for epoch in range(num_epochs):
for features, target in loader:
optimizer.zero_grad()
prediction = model(features)
loss = criterion(prediction, target)
loss.backward()
optimizer.step()

The ritual is worth committing to memory as a phrase: zero the gradients, run the forward pass, compute the loss, run the backward pass, take a step. Clear the slate, make a guess, measure the error, find which weights caused it, and nudge them. Working in batches rather than the whole dataset is a deliberate compromise: it gives more frequent updates than a full-batch pass and far less noise than updating on a single example at a time.

Evaluating without cheating

When you check how the model does on held-out data, you must not let that data change the weights, so the evaluation loop runs the forward pass only, with no backward and no step. Two switches make this correct and efficient.

model.eval()
with torch.no_grad():
for features, labels in val_loader:
outputs = model(features)
loss = criterion(outputs, labels)
model.train()

Calling model.eval() puts the model in evaluation mode, which turns off dropout so every neuron participates and tells batch normalisation to use its stored statistics, giving stable, repeatable outputs. Wrapping the loop in torch.no_grad() tells PyTorch not to build the graph it would need for gradients, which saves memory and runs faster. Forgetting model.eval() is a subtle bug, because dropout will stay active and your validation numbers will be both noisy and pessimistic. Flip back to model.train() afterwards so the next epoch behaves correctly. For metrics like accuracy that should reflect the whole validation set rather than a single batch, a library such as torchmetrics accumulates the result across batches and reports it once at the end.

Fighting overfitting

Overfitting is the moment the model stops learning patterns and starts memorising the training set, and you spot it when training loss keeps falling while validation loss begins to climb. The two regularisation tools you reach for first are dropout and weight decay.

Dropout randomly zeroes a fraction of neurons on each training pass. The effect is like studying with half your notes randomly hidden each session: because you never know which half will be missing, you are forced to learn every part independently rather than leaning on a few favourite neurons. At evaluation time dropout switches off and the full network is used.

model = nn.Sequential(
nn.Linear(8, 6),
nn.ReLU(),
nn.Dropout(p=0.5),
nn.Linear(6, 4)
)

Weight decay attacks overfitting from a different angle by adding a small penalty on large weights, which you enable simply by passing it to the optimizer.

optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

Large weights make a model hypersensitive to tiny input changes, a classic symptom of overfitting, so taxing them gently at every step pushes the model to spread its attention across many features. Beyond these two, the broader toolkit includes early stopping when validation loss turns upward, reducing the network’s capacity, data augmentation on small datasets, and the most reliable fix of all, gathering more data.

Initialisation and transfer learning

How weights start matters more than it first appears. If every weight begins identical, every neuron computes the same thing and they never differentiate. If weights start too large the signal explodes through deep layers, and too small it vanishes. Schemes like Xavier initialisation, suited to sigmoid and tanh, and Kaiming initialisation, suited to ReLU, choose the starting scale mathematically so the signal variance stays roughly steady from layer to layer.

Often you do not start from scratch at all. Transfer learning takes a model already trained on a large dataset and adapts it to your smaller task, and the key technique is freezing the early layers so their hard-won general features are preserved while only the later layers learn your specifics.

for name, param in model.named_parameters():
if name in ['0.weight', '0.bias']:
param.requires_grad = False

Setting requires_grad to false tells PyTorch to skip gradients for those parameters and the optimizer to leave them alone. When fine-tuning pretrained weights, use a small learning rate, something like 1e-5, so you adjust rather than destroy what the model already knows. Saving and loading is handled through the state dictionary, which maps each parameter name to its values; torch.save(model.state_dict(), path) writes it out and model.load_state_dict(torch.load(path)) reads it back into a model whose architecture matches exactly, since PyTorch matches parameters by name.

Tuning the knobs

The settings you choose rather than learn are the hyperparameters, and a handful dominate. The learning rate is the most important, governing step size; too high and the loss oscillates or blows up to NaN, too low and training crawls. Momentum, batch size, dropout rate, and weight decay each have a sensible range and a cost at either extreme. When you search for good values, random search usually beats an exhaustive grid, and the reason is instructive.

import numpy as np
for _ in range(10):
factor = np.random.uniform(2, 4)
lr = 10 ** -factor
momentum = np.random.uniform(0.85, 0.99)

Learning rates span orders of magnitude, so sampling them on a linear scale would cluster nearly every trial near the top and never explore the small end. Sampling the exponent instead, then converting with ten to the negative power, gives every order of magnitude an equal shot, which is exactly why random search on a log scale tends to find good learning rates that a linear grid misses.

A debugging habit worth keeping

Before you ever train on the full dataset, try to overfit a single tiny batch on purpose. If the model cannot drive the loss to near zero on eight examples it has seen repeatedly, the problem is not your data or your regularisation, it is a bug: a wrong output shape, the wrong loss for the task, or a missing zero_grad breaking gradient flow. Only once the model can memorise a handful of samples should you scale up and add regularisation. It is the fastest way to separate a broken pipeline from a hard learning problem.

Conclusion

Stripped to its essence, the PyTorch workflow is a pipeline. Raw data becomes feature and label arrays, those go into a TensorDataset and then a DataLoader that serves batches, each batch flows forward through the model to produce predictions, the loss measures the error, backward computes the gradients, the optimizer updates the weights, and you repeat for many epochs before switching to evaluation mode for validation and inference. Every deep learning project you meet, however elaborate, is this loop with richer layers and bigger data. Learn the loop and the rest is detail.

See you soon.

View Comments (1)

Leave a Reply

Prev

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading