Multi-Modal Models with Hugging Face

This article explores the capabilities of multi-modal machine learning using Hugging Face, detailing methods for data processing, model inference, and practical applications across text, images, audio, and video.

Most machine learning tutorials live comfortably inside one data type. You classify text, or you classify images, and the two worlds never meet. Real applications are messier than that. A video ad has frames and a soundtrack. A scanned invoice has words and layout. A chat message might contain a photo and a question about it. Multi-modal models handle exactly these situations, and the Hugging Face ecosystem has quietly become the easiest place to work with them.

This article walks through the full stack: finding models programmatically on the Hub, preprocessing text, images, and audio, running inference with pipelines and Auto classes, and then moving into genuinely multi-modal territory with CLIP, vision-language models, voice conversion, image editing, and video generation.

Searching the Hub Like a Database

The Hugging Face Hub hosts hundreds of thousands of models, and clicking through the website stops scaling quickly. The HfApi class lets you query the Hub from code instead.

from huggingface_hub import HfApi
api = HfApi()
models = api.list_models(task="text-to-image")
print(f"Task: text-to-image, Models: {len(list(models))}")

Think of the Hub as a giant app store for AI models and HfApi as your programmatic search bar. One detail trips people up: list_models returns a lazy generator rather than a full list, so you have to wrap it in list() before counting. It is the difference between a firehose and a bucket; you can only measure the water once it is in the bucket.

The search gets genuinely useful once you stack filters. Here is how you would find the most popular Stable Diffusion model from a specific organization and load it in one motion.

from diffusers import StableDiffusionPipeline
models = api.list_models(
task="text-to-image",
author="CompVis",
tags="diffusers:StableDiffusionPipeline",
sort="likes"
)
models = list(models)
pipe = StableDiffusionPipeline.from_pretrained(models[0].id)

This works like narrowing a restaurant search from “everything in the city” down to “Italian, in my neighborhood, four stars and up, sorted by review count.” The task filter restricts results to image generation, author pins the search to a known organization as a trust signal, tags guarantees the model actually loads with the Stable Diffusion pipeline class, and sort="likes" puts community favorites first. models[0].id then hands you the repo ID string of the winner, which from_pretraineddownloads and assembles into a runnable pipeline.

One Preprocessing Pattern, Three Modalities

Whatever the data type, preprocessing in the Hugging Face world follows the same shape: load a processor, call it on raw data, get back tensors. Only the processor class changes.

Text goes through a tokenizer. Language models cannot read words; they only understand numbers, so the tokenizer splits each sentence into sub-word pieces (“unbelievable” becomes “un”, “##believ”, “##able”), looks each piece up in a fixed vocabulary, and returns integer IDs.

from transformers import AutoTokenizer
caption = image_data[5]["caption"][0]
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
encoded_input = tokenizer(caption, return_tensors="pt")

The return_tensors="pt" argument wraps the ID lists in PyTorch tensors so the model can consume them directly, and the output dictionary includes both the input_ids and an attention_mask telling the model which positions are real tokens versus padding.

Images go through an image processor. Raw photos arrive in every shape, size, and pixel scale imaginable, and the processor is a standardization station: it resizes each image to the resolution the model was trained on, rescales pixel values from the 0 to 255 range into normalized floats, and packs everything into a tensor. Here is the full loop for generating a caption with BLIP.

from transformers import BlipProcessor, BlipForConditionalGeneration
photo = image_data[5]["image"]
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
inputs = processor(images=photo, return_tensors="pt")
output = model.generate(**inputs)
print(f"Generated caption: {processor.decode(output[0])}")

model.generate reads the standardized pixels and writes a caption one token at a time, exactly the way a text model generates text, except the prompt is an image instead of words. processor.decode then translates the generated token IDs back into a readable string.

Audio adds one extra wrinkle: sampling rate. Whisper was trained exclusively on audio recorded at 16,000 samples per second. Feed it CD-quality audio at 44,100 Hz and the model effectively hears everything at the wrong speed, like a record played too fast.

from datasets import Audio
from transformers import AutoProcessor
speech_data = speech_data.cast_column("audio", Audio(sampling_rate=16000))
processor = AutoProcessor.from_pretrained("openai/whisper-small")
audio_features = processor(
speech_data[0]["audio"]["array"],
sampling_rate=16000,
padding=True,
return_tensors="pt"
)

cast_column is elegant here: it does not resample the whole dataset upfront, it resamples lazily whenever you read a row. The processor then converts the waveform into a log-mel spectrogram, essentially a heatmap showing which frequencies are loud at which moments, because that visual representation is what Whisper’s encoder actually reads. padding=Truebrings variable-length clips up to a common length per batch.

Pipelines: The High-Level Shortcut

Everything in the previous section, loading the processor, transforming the data, running the model, decoding the output, collapses into a single call with pipeline.

from transformers import pipeline
captioner = pipeline(task="image-to-text", model="Salesforce/blip-image-captioning-base")
prediction = captioner(image_data[3]["image"])
print(prediction)

It is the difference between cooking from scratch and pressing a button on a microwave: same result, far less work, less control. When you do need control over the generation itself, pipelines accept a generate_kwargs dictionary that gets forwarded straight to the underlying model.generate() call. Here is MusicGen producing audio from a text description.

import soundfile as sf
music_pipe = pipeline(task="text-to-audio", model="facebook/musicgen-small", framework="pt")
generate_kwargs = {"temperature": 0.8, "max_new_tokens": 256}
outputs = music_pipe("Classic rock riff", generate_kwargs=generate_kwargs)
sf.write("riff.wav", outputs["audio"][0][0], outputs["sampling_rate"])

MusicGen generates audio token by token, much like a text model generates words. The temperature setting controls creativity: at 1.0 the model samples freely, and below that it grows more conservative and predictable. sf.write takes the raw audio array and saves it as a playable WAV file.

The rule of thumb for choosing between pipelines and Auto classes: reach for the pipeline when you want a single inference call and the default post-processing is fine. Drop down to Auto classes when you need raw logits or embeddings, when you are fine-tuning, or when you are chaining models together.

Evaluating on Your Own Data

Accuracy alone is a trap when classes are imbalanced. A model that always predicts the majority class scores high accuracy while being completely useless, and only precision and recall expose that. The evaluate library ships a task-aware evaluator that knows how to quiz an image classifier properly.

import evaluate
from evaluate import evaluator
task_evaluator = evaluator("image-classification")
task_evaluator.METRIC_KWARGS = {"average": "weighted"}
label_map = classifier_pipe.model.config.label2id
eval_results = task_evaluator.compute(
model_or_pipeline=classifier_pipe,
data=product_images,
metric=evaluate.combine(metrics_dict),
label_mapping=label_map
)
print(f"Precision: {eval_results['precision']:.2f}, Recall: {eval_results['recall']:.2f}")

The label_mapping step matters because your dataset stores labels as integers while the pipeline outputs strings like “cat”; the mapping bridges that gap. Weighted averaging scores each class in proportion to how often it appears, so rare classes do not get unfairly amplified or buried.

Vision Tasks: From Coarse to Fine

Image classification, object detection, and segmentation form a natural progression. Classification answers “what is the main thing in this photo” with one label for the whole image. Detection answers “where are all the things” with a bounding box per object. Segmentation goes down to the pixel level, which is what lets you cut a subject out of its background cleanly. Each is one pipeline swap away from the others.

import matplotlib.pyplot as plt
import matplotlib.patches as patches
classifier = pipeline(task="image-classification", model="google/mobilenet_v2_1.0_224")
pred = classifier(photo)
print("Predicted class:", pred[0]["label"])
detector = pipeline("object-detection", "facebook/detr-resnet-50", revision="no_timm")
detections = detector(photo)
for n, obj in enumerate(detections):
box = obj["box"]
rect = patches.Rectangle(
(box["xmin"], box["ymin"]),
box["xmax"] - box["xmin"], box["ymax"] - box["ymin"],
linewidth=1, edgecolor=colors[n], facecolor="none"
)
ax.add_patch(rect)
remover = pipeline(task="image-segmentation", model="briaai/RMBG-1.4", trust_remote_code=True)
cutout = remover(photo)
plt.imshow(cutout)
plt.show()

The detector returns a list of dictionaries, each holding a label, a confidence score, and pixel coordinates for the bounding box, which map directly onto matplotlib rectangles for visualization. Note the trust_remote_code=True flag on the segmentation model: some models ship custom Python code alongside their weights, and you have to opt in explicitly before that code runs on your machine.

Fine-tuning a vision model means changing the head

MobileNetV2 was originally trained to distinguish 1,000 ImageNet categories. Your dataset almost certainly has fewer. Fine-tuning surgically replaces the final classification layer, the “head,” with a new one sized for your label count, while keeping all the visual feature-detection layers intact.

from transformers import AutoModelForImageClassification
labels = product_images["train"].features["label"].names
label2id = {label: str(i) for i, label in enumerate(labels)}
id2label = {str(i): label for i, label in enumerate(labels)}
model = AutoModelForImageClassification.from_pretrained(
"google/mobilenet_v2_1.0_224",
num_labels=len(labels),
id2label=id2label,
label2id=label2id,
ignore_mismatched_sizes=True
)

The ignore_mismatched_sizes=True flag is the permission slip that says “yes, I know the new head is a different shape than the saved one, load it anyway.” The two dictionaries give the model a way to convert between index numbers and human-readable label names in both directions.

Before training, split the data and attach transforms lazily.

data_splits = product_images.train_test_split(test_size=0.2, seed=42)
transformed = data_splits.with_transform(transforms)
plt.imshow(transformed["train"][0]["pixel_values"].permute(1, 2, 0))

with_transform is like putting a filter on a camera lens rather than editing every photo in advance: the augmentation only runs at the moment you access a row, so you never hold a second transformed copy of the dataset in memory. The permute(1, 2, 0) call reorders axes because PyTorch stores images as channels-height-width while matplotlib expects height-width-channels.

Training itself runs through the familiar Trainer.

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="product_classifier",
learning_rate=6e-5,
gradient_accumulation_steps=4,
num_train_epochs=3,
push_to_hub=False
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=transformed["train"],
eval_dataset=transformed["test"],
processing_class=image_processor,
compute_metrics=compute_metrics,
)

The gradient_accumulation_steps=4 setting is a budget trick worth knowing. If your GPU can only fit a batch of 8 images, accumulation lets you process four mini-batches before each weight update, effectively simulating a batch of 32 without needing more memory. The data collator is the conveyor belt worker that stacks individual dataset rows into uniform batches.

Audio Tasks: Transcription and Voice

Whisper handles speech recognition as an encoder-decoder problem: the encoder reads the spectrogram and the decoder writes the transcript one word at a time.

from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.config.forced_decoder_ids = None
features = processor(
clip["array"], sampling_rate=16000,
return_tensors="pt", return_attention_mask=True
)
predicted_ids = model.generate(features.input_features)
transcription = processor.decode(predicted_ids[0], skip_special_tokens=True)
print(transcription)

Setting forced_decoder_ids = None removes the hardcoded language and task prefix, letting Whisper auto-detect the spoken language instead of assuming English. skip_special_tokens=True strips internal bookkeeping tokens like the transcript-start marker so you get clean text.

Voice work gets more interesting. A speaker embedding is a voiceprint: a compact vector capturing what makes a person’s voice sound like them, independent of the actual words spoken.

import torch
def create_speaker_embedding(waveform):
with torch.no_grad():
embedding = speaker_model.encode_batch(torch.tensor(waveform))
embedding = torch.nn.functional.normalize(embedding, dim=2)
embedding = embedding.squeeze().cpu().numpy()
return embedding
voice_sample = speech_data[10]["audio"]["array"]
speaker_embedding = create_speaker_embedding(voice_sample)

Normalizing every embedding to unit length puts all the voiceprints on the same sphere, so you can compare two voices by measuring the angle between their vectors regardless of how loud either recording was. That embedding then powers voice conversion: same script, different actor.

from transformers import SpeechT5ForSpeechToSpeech
import soundfile as sf
vc_model = SpeechT5ForSpeechToSpeech.from_pretrained("microsoft/speecht5_vc")
inputs = processor(audio=source_speech, sampling_rate=16000, return_tensors="pt")
speech = vc_model.generate_speech(inputs["input_values"], speaker_embedding, vocoder=vocoder)
sf.write("converted.wav", speech.numpy(), samplerate=16000)

The source audio provides what was said, the speaker embedding provides whose voice to say it in, and the vocoder is the neural synthesizer that turns the model’s internal spectrogram representation back into an actual playable waveform.

Fine-tuning text-to-speech follows the same Trainer pattern with a sequence-to-sequence variant, and two of its arguments deserve a sentence each. warmup_steps=500 starts the learning rate near zero and ramps it up gradually, easing the car out of the parking space rather than flooring it, which avoids chaotic early updates. max_steps=4000 stops training by step count rather than waiting for full epochs, useful for time-boxing a run on a large dataset.

CLIP: Where Modalities Meet

CLIP is the conceptual heart of multi-modal ML. It was trained on millions of image-caption pairs, learning to pull matching pairs together in a shared vector space while pushing mismatched pairs apart. The payoff is zero-shot classification: you describe your categories in plain English, and CLIP measures which description sits nearest to the image. No classifier training required.

inputs = processor(
text=categories, images=image_data[999]["image"],
return_tensors="pt", padding=True
)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
category = categories[probs.argmax().item()]
print(f"Predicted category: {category}")

The processor handles both modalities in one call, tokenizing the candidate labels and preprocessing the image together. logits_per_image holds the similarity between the image and each label, and the softmax converts those raw similarities into a probability distribution.

The same shared space yields a free quality metric. CLIP score asks “how well does this caption describe this image” without needing any human reference.

from torchvision.transforms import ToTensor
image_tensor = ToTensor()(photo) * 255
score = clip_score(image_tensor, description, "openai/clip-vit-base-patch32")
print(f"CLIP score: {score}")

One gotcha hides in the first line: ToTensor rescales pixels into the 0 to 1 range, but the scoring function expects the original 0 to 255 values, so you multiply back by 255 to undo the scaling before passing the tensor in.

Vision-Language Models and Chat Templates

Modern vision-language models accept chat-style messages where images and text are both first-class content, like attaching a photo to a chat message and asking a question about it.

article_body = news_data[6]["content"]
question = (
f"Does this news article have a positive, negative, or neutral impact "
f"on the team's championship chances: {article_body}. Provide reasoning."
)
chat_template = [
{"role": "user", "content": [
{"type": "image", "image": photo},
{"type": "text", "text": question},
]}
]

Turning that structured message into model output takes a few steps, each with a clear job.

text = vl_processor.apply_chat_template(
chat_template, tokenize=False, add_generation_prompt=True
)
image_inputs, _ = process_vision_info(chat_template)
inputs = vl_processor(text=[text], images=image_inputs, padding=True, return_tensors="pt")
generated_ids = vl_model.generate(**inputs, max_new_tokens=500)
trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
answer = vl_processor.batch_decode(trimmed, skip_special_tokens=True)
print(answer[0])

apply_chat_template renders the list of message dictionaries into the flat prompt string this particular model expects, since different VLMs use different separator tokens and layouts. add_generation_prompt=True appends the assistant’s turn-start token, the signal that says “now it is your turn to respond.” The trimming step exists because model.generate returns the full sequence, prompt included, and you only want the newly generated part, so you slice off the first len(in_ids) tokens of each output.

Late Fusion: Scoring a Video with Two Models

Here is a genuinely multi-modal workflow. Suppose you want to score the emotional tone of a video ad. A video is really two streams duct-taped together, moving images and an audio track, so the first step is pulling them apart.

from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_subclip
from moviepy.editor import VideoFileClip
ffmpeg_extract_subclip("trainer_ad.mp4", 0, 5, outputfile="trainer_ad_5s.mp4")
video = VideoFileClip("trainer_ad_5s.mp4")
audio = video.audio
audio.write_audiofile("trainer_ad_5s.mp3")

Then you run two completely separate zero-shot classifiers, CLIP for what the video looks like and CLAP (its audio counterpart) for what it sounds like, and average their scores at the end. This strategy is called late fusion: ask two experts for their independent opinions, then split the difference.

audio_classifier = pipeline(model="laion/clap-htsat-unfused",
task="zero-shot-audio-classification")
image_classifier = pipeline(model="openai/clip-vit-base-patch32",
task="zero-shot-image-classification")
frame_predictions = image_classifier(video_frames, candidate_labels=emotions)
frame_scores = [{l["label"]: l["score"] for l in p} for p in frame_predictions]
avg_image_scores = {e: sum(s[e] for s in frame_scores) / len(frame_scores) for e in emotions}
audio_scores = audio_classifier(audio_sample, candidate_labels=emotions)
audio_scores = {l["label"]: l["score"] for l in audio_scores}
multimodal_scores = {
e: (avg_image_scores[e] + audio_scores[e]) / 2 for e in emotions
}
print(f"Multimodal scores: {multimodal_scores}")

CLIP runs on every frame and the per-emotion confidences get averaged across time, giving the visual channel’s verdict. CLAP gives the audio channel’s verdict in one pass. The final score is the arithmetic mean of both. Because both classifiers are zero-shot, the emotion labels are just a Python list you define on the spot; no retraining anywhere.

Asking Questions About Images and Documents

Visual question answering with ViLT jointly encodes the image and the question in a single pass, unlike early systems that processed the two separately and merged late.

question = "What color is the traffic light?"
encoding = processor(photo, question, return_tensors="pt")
outputs = model(**encoding)
idx = outputs.logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[idx])

The model scores every answer in its fixed vocabulary, argmax picks the winner, and id2label translates the index back into a word like “red.”

Document QA goes a step further by understanding layout, not just text. LayoutLM was trained on OCR text combined with position coordinates, so it knows the difference between a table cell, a heading, and a footnote.

doc_qa = pipeline(task="document-question-answering", model="impira/layoutlm-document-qa")
result = doc_qa(
report_pages["test"][61]["image"],
"How many days of formal training were provided to employees in 2012-2013?"
)
print(result)

It is like asking a question to someone reading a scanned form: they do not just read the words, they understand where on the page each piece of information lives.

Controlled Image Generation and Editing

Stable Diffusion generates freely from text. ControlNet bolts on a spatial constraint: extract the edge skeleton of a source image using Canny edge detection, and the model must generate something that respects those outlines while following your prompt. It is like handing an artist a pencil sketch and saying “paint this, but make it look like Snoopy.”

from diffusers import ControlNetModel, StableDiffusionControlNetPipeline
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
output = pipe(
["Snoopy, best quality, extremely detailed"],
canny_image,
negative_prompt=["monochrome, lowres, bad anatomy, worst quality, low quality"],
generator=generator,
num_inference_steps=20,
)
plt.imshow(output.images[0])
plt.show()

A few practical notes packed into this snippet: torch_dtype=torch.float16 runs everything in half precision for speed and lower VRAM, the negative_prompt acts as an anti-prompt steering generation away from listed qualities, the pre-seeded generatormakes results reproducible, and more inference steps mean higher fidelity at the cost of time.

Inpainting applies the same machinery to editing a single region of an existing image. The mask is a binary map where white pixels say “regenerate this area” and black pixels say “leave this alone.”

output = pipe(
"a black beard, best quality, extremely detailed",
num_inference_steps=40, eta=1.0,
image=init_image, mask_image=mask_image, control_image=control_image
)

The model fills the masked region guided by the prompt and the ControlNet conditioning, blending the new content into the untouched surroundings. The eta=1.0 setting introduces stochasticity into the sampling schedule, allowing more creative variation inside the filled region; at zero, the process is deterministic.

Generating Video and Scoring It

Video diffusion extends image diffusion into the time dimension: instead of denoising one frame, the model denoises a stack of frames simultaneously, learning how motion and appearance evolve together.

from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = "A robot doing the robot dance. The dance floor has colorful squares and a glitterball."
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16)
video = pipe(
prompt=prompt,
num_inference_steps=20,
num_frames=20,
guidance_scale=6
)
frames = video.frames[0]
video_path = export_to_video(frames, "robot_dance.mp4", fps=8)

guidance_scale is the dial between creativity and prompt faithfulness: higher values make the model cling more tightly to your description at the cost of visual variety. Twenty frames at 8 fps gives you roughly two and a half seconds of footage, and export_to_video stitches the PIL frames into an MP4.

Did the video actually match the prompt? Score every frame against it with CLIP and average.

from functools import partial
import numpy as np
clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch32")
frame_tensors = []
for frame in frames:
frame = np.array(frame)
frame_int = (frame * 255).astype("uint8")
frame_tensor = torch.from_numpy(frame_int).permute(2, 0, 1)
frame_tensors.append(frame_tensor)
scores = clip_score_fn(frame_tensors, [prompt] * len(frame_tensors)).detach().cpu().numpy()
avg_clip_score = round(np.mean(scores), 4)
print(f"Average CLIP score: {avg_clip_score}")

Each frame is just a still image, so the same CLIP score logic from earlier applies. partial pre-fills the model argument so each call only needs images and texts, like presetting a radio station before every tune. The permute(2, 0, 1) reorders each frame from height-width-channels into the channels-first layout torchvision expects, and the final average gives one number summarizing how well the whole clip matches the description.

Choosing Your Path Through the Stack

When you only need inference with sensible default post-processing, pipeline(task=..., model=...) is the answer, whatever the modality. The moment you need raw logits or embeddings, want to chain models together, or plan to fine-tune, drop down to the processor and Auto class pair, and bring in Trainer or Seq2SeqTrainer for the training itself. Image and video generation live in the separate diffusers library, with StableDiffusionPipeline, ControlNet variants, and CogVideoXPipeline covering generation, constrained generation, and video respectively.

The deeper pattern worth internalizing is that every modality flows through the same three stages: a processor turns raw data into tensors, a model turns tensors into predictions or generations, and a decoder or post-processor turns the output back into something human-readable. Once that pattern clicks, a new modality is just a new processor class, and “multi-modal” stops being intimidating: it is two of these flows running side by side, meeting either in a shared embedding space like CLIP’s or in a simple average at the end.

See you soon.

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Datalad - Data Science and ML

Subscribe now to keep reading and get access to the full archive.

Continue reading