Keras is the friendliest on-ramp to deep learning, and the reason is that almost everything reduces to the same three steps. You build a model by stacking layers, you compile it by choosing an optimizer and a loss, and you fit it to your data for some number of epochs. Once that rhythm is in your hands, the difference between a regression model, an image classifier, and a text generator is mostly a matter of which layers you stack and which loss you pick. This article is a practical tour through all of those cases, from your first network to convolutional nets, transfer learning, and LSTMs.
The three-step rhythm
Every Keras project follows the same arc. The Sequential model is an empty container you add layers to, the compile call decides how the model learns, and the fit call decides how long it trains. After that you evaluate on held-out data and predicton new inputs. Hold onto this shape, because nothing below departs from it.
from keras.models import Sequentialfrom keras.layers import Densemodel = Sequential()model.add(Dense(10, input_shape=(2,), activation='relu'))model.add(Dense(1))
This minimal network takes two inputs, passes them through ten hidden neurons with a ReLU activation, and produces a single number. Picture the hidden layer as ten workers who each receive the two inputs, process them, and pass a result forward, and the output neuron as the one who reads all ten results and writes down the final prediction. ReLU simply means each worker ignores negative signals, clamping them to zero, before passing the baton.
The single most important decision
If there is one thing to internalise about Keras, it is how the last layer and the loss function must match the task, because getting this wrong is the most common beginner mistake and it fails quietly rather than loudly.
For regression you want any real number out, so the output layer has no activation and you train with mean squared error. Adding a sigmoid would wrongly clamp the output to the zero-to-one range, and a softmax would force it into a probability. For binary classification you want a single probability, so one neuron with a sigmoid paired with binary cross-entropy. For multi-class problems where each example belongs to exactly one class, you want one neuron per class with a softmax, which forces the scores to compete and sum to one, paired with categorical cross-entropy. For multi-label problems where several labels can be true at once, you want one neuron per label, each with its own independent sigmoid, again with binary cross-entropy. The deciding question is always whether more than one label can apply to the same input. If no, it is multi-class with softmax. If yes, it is multi-label with sigmoid.
It is worth understanding the parameter count too, because it tells you your overfitting risk. A dense layer has one weight per input-output connection plus one bias per output, so a layer taking three inputs into five neurons has fifteen weights and five biases, twenty numbers in all. Call model.summary() and check the total against your sample size. Fifty thousand parameters trained on two hundred examples is a guaranteed overfit.
Regression and the limits of a model
A deeper regression model with several hidden layers can approximate a curved relationship, like the arc of a projectile, which is roughly quadratic. Each hidden layer refines the representation the previous one produced, so stacking a few gives the network the capacity to bend its output into complex shapes.
model.compile(optimizer='adam', loss='mse')model.fit(time_steps, heights, epochs=30)predicted_path = model.predict(np.arange(-10, 11))
Compiling is the briefing, choosing Adam as the optimization strategy and mean squared error as the way to measure mistakes. Fitting is the practice, showing the model every time step and its height thirty times over while it adjusts weights to shrink the error. Predicting is game day, a pure forward pass with no learning. The trap here is extrapolation. If the model only ever saw times between minus ten and ten, asking it to predict at time fifty produces nonsense, because neural networks do not extrapolate cleanly beyond the range they were trained on. Always know that range.
Classification, from one class to many
Binary classification is the smallest classifier: a few features flowing into one sigmoid neuron that outputs a probability, with binary cross-entropy measuring how wrong those probabilities are. Before modelling anything, though, look at the data. A seaborn pairplot coloured by label, alongside describe() and value_counts(), will tell you in seconds whether the classes are visibly separable and whether they are badly imbalanced, problems a loss curve alone will never explain to you later.
Multi-class is the natural extension. Say you are predicting which of four players made a throw from its landing coordinates. You use four output neurons with a softmax, which behaves like a vote counter, turning raw scores into percentages that sum to one hundred so the classes genuinely compete. The one preparation step that matters is the labels. The model speaks numbers, not names, so you map each player name to an integer, but a raw integer wrongly implies an ordering, so you one-hot encode it into a vector with a single one in the right slot.
from keras.utils import to_categoricalplayers['player'] = pd.Categorical(players['player']).codesy = to_categorical(players['player'])
When you predict, the model returns a probability per class, and np.argmax picks the winning index. One subtlety to remember: use categorical cross-entropy when your labels are one-hot like this, but switch to sparse categorical cross-entropy if you keep your labels as plain integers, because mixing them up throws shape errors.
Multi-label is where people most often go wrong. Imagine a smart building deciding which of three systems, heating, cooling, and ventilation, to switch on, where any combination can be active at once. Each output neuron gets its own sigmoid so the three probabilities are independent and do not have to sum to one, and you round each at the 0.5 threshold to get a yes or no decision. Run this with a softmax by mistake and you force the model to pick exactly one system, which is wrong by the very definition of the problem.
Watching training, and stopping it well
The fit call returns an object whose history holds the metrics for every epoch, and plotting those curves is how you actually diagnose a model. Always pass validation data so you get both training and validation lines.
h = model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test))
The shape of those two curves tells the story. If both are still falling, you have undertrained and should train longer. If training loss keeps dropping while validation loss turns upward, you are overfitting, memorising the training set like a student who learned the practice answers but cannot handle rephrased questions. If both flatline high, the model is too simple. If both settle low, you have a good fit. One small practical note: newer versions of Keras name the accuracy metric accuracy and val_accuracy, where older ones used acc, so reach for the full name if you hit a key error.
Rather than guessing the right number of epochs, let callbacks handle it. Early stopping watches a validation metric and halts when it stops improving for a set number of epochs, and a model checkpoint saves the best version seen along the way.
from keras.callbacks import EarlyStopping, ModelCheckpointearly_stop = EarlyStopping(monitor='val_accuracy', patience=5, restore_best_weights=True)checkpoint = ModelCheckpoint('best_model.keras', save_best_only=True)model.fit(X_train, y_train, epochs=1_000_000, validation_data=(X_test, y_test), callbacks=[early_stop, checkpoint])
These two work as a team. Early stopping is the coach who decides when to stop, the patience setting being how forgiving they are of short-term dips, and the checkpoint is the photographer capturing every new personal best to disk. Adding restore_best_weights=True is the detail that saves you from training fifty epochs and ending on a bad one, because it rolls the weights back to the best epoch. The absurd epoch count simply means you are letting the callback decide when to finish rather than imposing a ceiling.
A related question is whether more data would even help. Train the same model on progressively larger slices, resetting the weights each time for a fair comparison, and plot test accuracy against dataset size. If the curve is still climbing at the largest slice, more data will help. If it has flattened, you have hit a plateau and need a better architecture or better features instead.
Tools that stabilise training
A few choices make training faster and steadier. The activation function is one, and the honest way to choose is to train otherwise-identical models that differ only in their activation and compare the validation curves, like testing different running shoes on the same course. ReLU is the default for hidden layers, Leaky ReLU rescues it when neurons die, sigmoid and tanh saturate and are better kept to output layers or shallow nets.
Batch size is another. It controls how many examples the model sees before updating its weights. A batch of one updates after every example, responsive but noisy, like a chef re-seasoning after each bite. The whole dataset at once is stable but slow to converge. Most practitioners live at thirty-two or sixty-four and adjust for memory and behaviour.
Batch normalization is the highest-leverage of the three. As values flow through a deep network they can grow or shrink wildly, forcing each layer to chase a moving target as the layer before it shifts.
from keras.layers import BatchNormalizationmodel.add(Dense(50, input_shape=(64,), activation='relu'))model.add(BatchNormalization())
A batch norm layer acts like a thermostat between layers, rescaling each layer’s outputs to roughly zero mean and unit variance so the next layer always sees inputs in a sensible range. In practice it lets you use higher learning rates, makes the network far less sensitive to how the weights were initialised, and even acts as a mild regulariser.
Tuning hyperparameters with sklearn
To search hyperparameters systematically, wrap a Keras model so sklearn’s tools can drive it. The pattern is a factory function that builds and compiles a model from whichever parameters you want to tune.
def create_model(learning_rate, activation): opt = Adam(learning_rate=learning_rate) model = Sequential() model.add(Dense(128, input_shape=(30,), activation=activation)) model.add(Dense(256, activation=activation)) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy']) return model
You then hand that factory to a randomized search over a space of activations, batch sizes, epochs, and learning rates. Random search beats grid search here because a full grid explodes combinatorially, and randomly sampling the space tends to find a strong configuration after far fewer trials when not every hyperparameter matters equally. Two modern notes worth heeding: pass learning_rate to Adam rather than the deprecated lr, and reach for the third-party scikeras wrapper, since the old keras.wrappers.scikit_learn was removed in Keras 3.
Beyond plain dense networks
The same three-step rhythm carries you into more specialised architectures.
An autoencoder learns to compress and reconstruct its own input by forcing it through a small bottleneck layer. With 784 pixels squeezed into 32 numbers and then expanded back, the network has to decide what is essential enough to keep, and whatever it preserves is what it has judged important. That makes autoencoders natural denoisers, since a 32-number bottleneck has no room to store random noise, and useful for anomaly detection, where a high reconstruction error flags an unusual sample.
Convolutional networks are how you handle images without millions of parameters. Instead of connecting every pixel to every neuron, a Conv2D layer slides a small window across the image and learns to detect local patterns like edges and curves, applying the same detector everywhere.
from keras.layers import Conv2D, Flattenmodel.add(Conv2D(32, kernel_size=3, input_shape=(28, 28, 1), activation='relu'))model.add(Conv2D(16, kernel_size=3, activation='relu'))model.add(Flatten())model.add(Dense(10, activation='softmax'))
Thirty-two filters means thirty-two different pattern detectors examining the image at once, and Flatten unrolls their two-dimensional results into a vector the dense classifier can read. You can even build a mini-model that stops after the first convolution to visualise what each filter learned, and you will see early filters lighting up on edges while deeper ones respond to parts of objects.
Transfer learning is the shortcut you should almost always consider. A model like ResNet50, pretrained on over a million ImageNet photos, hands you weeks of learning for free.
from tensorflow.keras.applications.resnet50 import preprocess_input, ResNet50, decode_predictionsmodel = ResNet50(weights='imagenet')preds = model.predict(img_ready)print(decode_predictions(preds, top=3)[0])
The one rule that matters is preprocessing. Each pretrained family expects images prepared in a precise way, the right size and the right pixel normalisation, so you must use the exact preprocess_input bundled with the model. Use the wrong one and accuracy quietly collapses. Depending on your task you might use the pretrained model for inference only, freeze it and train a new classifier on top, or unfreeze the top layers and fine-tune with a tiny learning rate.
Finally, LSTMs handle text by reading a sequence while maintaining a running memory. To predict the next word you slice the text into overlapping windows, three words as input and the fourth as the target, and convert words to integers with a tokenizer.
from keras.layers import Embedding, LSTM, Densemodel.add(Embedding(input_dim=vocab_size, input_length=3, output_dim=8))model.add(LSTM(32))model.add(Dense(32, activation='relu'))model.add(Dense(vocab_size, activation='softmax'))
An embedding layer turns each word index into a learned vector where similar words sit close together, which a raw integer could never express. The LSTM reads those word vectors one at a time, keeping a memory of what it has seen, and produces a summary that the final softmax turns into a probability over the whole vocabulary. One thing to watch when generating text: always taking the single most likely word with argmax tends to get stuck in loops on small corpora, so sampling from the probability distribution, or top-k sampling, gives livelier output.
The thread through all of it
Every model in this article, however different on the surface, is the same three steps wearing different clothes. Build by stacking the layers the data calls for, dense for tabular, convolutional for images, recurrent for sequences. Compile with the loss and output activation that match the task, which is the decision most worth getting right. Fit, then watch the learning curves and let callbacks decide when to stop. Learn that rhythm once and Keras stops being a catalogue of special cases and becomes a single, flexible habit.
See you soon.
[…] Neural networks with Keras […]