Working with Llama 3

Running a large language model like Llama 3 locally enhances privacy and control, allowing users to process data without external servers while managing inference costs effectively through tailored settings and structured prompts.

There is something quietly radical about running a large language model on your own laptop. No API key, no per-token billing, no sending your data to someone else’s servers. Llama 3 is Meta’s open-weight model, and unlike API-based models such as GPT-4 or Claude, you can download the weights and run inference entirely on your own machine. The tool that makes this practical in Python is llama-cpp-python, a wrapper around llama.cpp, a C++ library that runs Llama models efficiently on ordinary CPUs and GPUs. It loads models in the .gguf format, a quantized representation that compresses the weights enough to fit in consumer-grade memory.

For a data team, the case for local inference comes down to three things. Privacy: sensitive data never leaves your infrastructure. Cost: once the model is downloaded, inference is free. Control: you choose the model size, the quantization level, and every decoding parameter, rather than accepting whatever defaults a provider hands you.

Loading a Model and Asking It Something

from llama_cpp import Llama
llm = Llama(model_path=weights_file)
question = "Which programming language is most widely used for data analysis?"
response = llm(question)
print(response)

Loading the model is like opening a very large book from your hard drive and holding it in RAM. Once it is open, calling llm(question) is like asking a question to someone who has memorized that book. The model reads your text, predicts the most likely continuation token by token, and returns the full result as a dictionary. No internet connection is involved at any point.

What Comes Back

The raw response carries more than just the answer. It looks something like this:

{
"id": "cmpl-...",
"object": "text_completion",
"model": "...",
"choices": [
{"text": "Python is the most widely used...", "index": 0, "finish_reason": "length"}
],
"usage": {"prompt_tokens": 9, "completion_tokens": 16, "total_tokens": 25}
}

Think of it as a receipt from a transaction: the item you actually wanted is in there, surrounded by bookkeeping. The generated text lives at choices[0]["text"]. The choices field is a list because you could in principle request several completions at once, so [0] grabs the first and usually only one. The token counts under usage only matter when you are monitoring throughput or debugging.

answer = response["choices"][0]["text"]
print(answer)

This response shape is deliberately borrowed from the OpenAI API specification. llama-cpp-python mimics it on purpose, so code written against one is largely portable to the other.

The Dials You Can Turn

Decoding parameters control how the model picks each next token, and they are the most important levers you will touch. At every step the model produces a probability distribution over its entire vocabulary, and these parameters reshape or trim that distribution before a token gets sampled.

temperature reshapes the whole distribution: high values flatten it so unlikely words get a real chance, low values sharpen it so the top choice dominates. top_k truncates the distribution to the K most likely tokens and samples only among those. top_p, known as nucleus sampling, keeps the smallest set of tokens whose probabilities sum to at least p, which means it adapts to the shape of the distribution rather than using a fixed count. max_tokens caps output length, and stop is a list of strings that halt generation the moment one of them appears. In practice you combine them; something like top_k=40, top_p=0.9, temperature=0.7 is a common balanced setting.

The right values depend entirely on the job. Here is a customer support reply with the creativity dial turned up:

output = llm("Can I return a product I bought last week?", temperature=0.9, max_tokens=15)

High temperature tells the model to be a little adventurous rather than always picking the most obvious next word. For support responses you want answers that feel natural and varied rather than scripted, and at 0.9 each run phrases things slightly differently, the way a person does.

Now the opposite extreme, a medical assistant where you want the textbook answer every time:

output = llm("What are the common symptoms of seasonal flu?", max_tokens=10, top_k=2)

Setting top_k=2 means the model can only choose between the two most probable tokens at each step, like forcing someone to answer from a two-option list instead of their full vocabulary. For medical information you do not want creative riffs; you want the statistically most defensible answer. Call it the trust-the-textbook setting.

And for marketing copy, nucleus sampling earns its keep:

output = llm("Write a short post announcing our summer sale", max_tokens=15, top_p=0.9)

top_p=0.9 is smarter than a fixed top_k because it adapts. When the model is very confident and one word holds 85% of the probability, the nucleus might contain just two candidates. When it is genuinely uncertain and ten words each sit near 9%, all ten stay in play. You get variety exactly where the model is unsure and focus exactly where it is sure, which is what creative but coherent copy needs.

As a rough map: factual and consistent tasks like medical or legal questions want temperature between 0.1 and 0.3 with a small top_k. A balanced assistant sits around temperature 0.5 to 0.7 with top_p=0.9. Creative work pushes temperature to 0.8 and above. Strict formats like JSON extraction want temperature near zero plus the structured output machinery we will get to shortly.

Chat Completions and Roles

Beyond raw text completion, Llama supports a chat format built from messages with labeled roles. A system message sets behavior, persona, and rules, sent once at the start. user messages carry the human’s questions. assistant messages hold the model’s previous replies, which is how a conversation continues across turns.

prompt = "Give me four short steps to troubleshoot a slow wifi connection."
chat_history = [
{"role": "user", "content": prompt}
]
result = llm.create_chat_completion(messages=chat_history, max_tokens=20)
print(result)

The chat format is a structured script with labeled speakers instead of a blank page. Even with a single user message, wrapping it in a messages list signals “you are in dialogue mode, not autocomplete mode,” and the model responds as the assistant character in that script. create_chat_completion is a genuinely different code path from calling llm(...) directly, because the model formats its internal prompt differently for dialogue.

The system message is where the real steering happens.

chat_history = [
{"role": "system", "content": "You are a friendly and professional support assistant for a broadband provider. If a question is not about internet service, reply exactly: 'Sorry, I can't help with that.'"},
{"role": "user", "content": "Which cryptocurrency should I buy this year?"}
]
result = llm.create_chat_completion(messages=chat_history, max_tokens=15)
assistant_reply = result["choices"][0]["message"]["content"]
print(assistant_reply)

The system message is the backstage director whispering to the actor before the scene starts: you play a broadband support rep, stay in character no matter what the audience shouts. Because it arrives before any user turn, the model treats it as foundational context rather than a request to answer, so the guardrail holds even when the user asks about crypto.

One detail will save you a debugging session: chat responses nest one level deeper than raw completions. A raw completion puts the text at choices[0]["text"], while a chat completion puts it at choices[0]["message"]["content"]. Mixing the two up produces a KeyError that looks more mysterious than it is.

Prompt Engineering That Actually Moves the Needle

Clear labels help the model separate the instruction from the context from the expected output.

prompt = """
Instruction: Explain the concept of photosynthesis in simple terms.
Question: What is photosynthesis?
Answer:
"""
output = llm(prompt, max_tokens=15, stop=["Question:"])
print(output['choices'][0]['text'])

Labels like Instruction:Question:, and Answer: act as signposts the model has seen countless times in its training data. They create a contract: here is what I want, here is the context, now fill in this blank. Leaving Answer: dangling with nothing after it is the cue for the model to start writing exactly there. The stop=["Question:"] argument is the safety net; if the model gets carried away and tries to invent a follow-up question after its answer, generation halts on the spot.

Few-shot prompting goes a step further: teach by example inside the prompt itself.

prompt = """Review 1: Ordered dinner here last Friday and it exceeded every expectation!
Sentiment 1: Positive,
Review 2: My delivery arrived cold and an hour late with no apology. Never again.
Sentiment 2: Negative,
Review 3: Fresh ingredients and generous portions. Will absolutely come back!
Sentiment 3: Positive,
Review 4: Wonderful meal and the staff were lovely!
Sentiment 4:"""
output = llm(prompt, max_tokens=2, stop=["Review"])
print(output['choices'][0]['text'])

The model sees three worked problems, review text paired with a label, and learns the pattern before hitting the fourth unlabeled one. It is like showing a student three solved algebra problems and presenting the fourth unsolved. max_tokens=2gives it just enough room to write one word and stop, and stop=["Review"] keeps it from inventing a fifth review out of habit.

When to use which? If the task is well-known and simple, just ask; zero-shot works fine for “summarize this.” If the output needs a specific format or style, show two to five examples. Otherwise start with a labeled, structured prompt and add examples only if the results wobble.

Structured Outputs: When You Need JSON, Not Prose

For machine-to-machine pipelines, a polite request for JSON is not enough. Without enforcement, the model might wrap the JSON in an explanation paragraph or produce subtly invalid syntax. The response_format argument fixes this at a deeper level than prompting.

output = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You convert stock lists from text to JSON, extracting item names and counts as keys and values in the form item: count; for example, 'notebook': 12."},
{"role": "user", "content": "Twenty laptops, eight monitors, and three hundred forty USB cables."},
],
response_format={"type": "json_object"}
)
print(output['choices'][0]['message']['content'])

This is not just another prompt instruction. Setting response_format={"type": "json_object"} engages constrained decoding at the token level: the model is physically prevented from emitting any token that would make the output syntactically invalid JSON, the way autocorrect forces a real word rather than random characters.

When you need specific keys with specific types, add a schema.

output = llm.create_chat_completion(
messages=messages,
response_format={
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"Question": {"type": "string"},
"Answer": {"type": "string"}
}
}
}
)
print(output['choices'][0]['message']['content'])

A schema is a blueprint that says not just “output JSON” but “output JSON with exactly these named fields of exactly these types.” Without it the model might invent key names like "q" or "answer_text". With it, the constrained decoder enforces the blueprint at every single token, the same way a form with fixed labeled boxes stops you from writing in the wrong section. This is dramatically more reliable than asking nicely in the prompt.

Building a Chatbot with Memory

Here is the thing nobody tells you upfront: LLM chat APIs are stateless. The model has no memory of what you said a minute ago. Every call is like talking to someone with complete amnesia, and the only cure is to repeat the entire conversation from the start, every time. A small wrapper class makes that bearable.

class ChatSession:
def __init__(self, llm: Llama, system_prompt='', history=None):
self.llm = llm
self.system_prompt = system_prompt
history = history or []
self.history = [{"role": "system", "content": self.system_prompt}] + history
def ask(self, user_prompt=''):
self.history.append({"role": "user", "content": user_prompt})
output = self.llm.create_chat_completion(messages=self.history)
assistant_message = output['choices'][0]['message']
self.history.append(assistant_message)
return assistant_message['content']

The class keeps a running transcript in self.history, seeded with the system message. Every call to ask appends your new message, sends the full transcript to the model, then appends the model’s reply. Because the reply joins the history, the next call automatically includes it as context, which is the entire trick behind multi-turn memory.

A small but classic Python detail hides in the constructor. You may see versions of this class written with history=[] as the default argument. That is a well-known trap: Python evaluates default arguments once, at function definition, so every instance of the class would silently share the same list, and conversations would bleed into each other. Defaulting to Noneand creating a fresh list inside the constructor avoids it.

Even for a single question the wrapper is handy, because the system prompt gets formatted correctly and the bookkeeping is handled for you.

instruction = "You are a travel expert that recommends a destination based on a request. Return the location name only as 'City, Country'."
travel_bot = ChatSession(llm, system_prompt=instruction)
result = travel_bot.ask("I'd like to explore ancient Roman history.")
print(result) # e.g., "Rome, Italy"

The instruction locks the output to a City, Country format, so downstream code can parse the answer without extra cleaning. The real payoff shows up across multiple turns.

travel_bot = ChatSession(
llm,
system_prompt="You are a travel expert that recommends a destination based on a request. Return the location name only as 'City, Country'."
)
first = travel_bot.ask("Recommend a French-speaking city.")
print(first) # e.g., "Paris, France"
second = travel_bot.ask("A different city in the same country")
print(second) # e.g., "Lyon, France"

That second request works because the model receives the full transcript: it sees that it previously answered “Paris, France,” infers that “same country” means France, and picks another French city. Without the history being re-sent, “same country” would be a dangling reference the model could not resolve. This is exactly how humans follow pronouns and callbacks in conversation, except here you are paying for the privilege by re-sending the transcript on every turn.

Pitfalls Worth Knowing in Advance

A few mistakes account for most of the frustration people hit with local Llama work. Forgetting max_tokens lets the model ramble for thousands of tokens, so always set a sensible cap. Using temperature=0 for everything makes outputs robotic and identical across runs; match the temperature to the use case instead. Skipping stop words in completion mode invites the model to invent follow-up questions and reviews that you never asked for. Confusing the two response shapes produces puzzling KeyError messages; remember that completions use choices[0]["text"] while chat uses choices[0]["message"]["content"]. Asking for JSON in the prompt alone occasionally returns prose or broken syntax, which is what response_format with a schema exists to prevent. And in any conversation wrapper, forgetting to append the assistant’s reply to the history gives you a chatbot with no memory of its own answers, which makes for some very confusing conversations.

Run the model, turn the dials, and keep the transcript. That is most of the craft, and all of it happens on hardware you own.

See you soon.

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Datalad - Data Science and ML

Subscribe now to keep reading and get access to the full archive.

Continue reading