Python Strings, Formatting, and Regex: A Complete Guide

Strings are how programs talk to humans. Every error message, every API response, every CSV cell, every log line is a string, which means a surprising share of everyday code is really just string wrangling. This guide covers four progressively more powerful ways to handle them: the basic methods for cleaning and slicing, the three formatting systems for producing readable output, regular expressions for matching fuzzy patterns, and the way these combine to extract structured data from messy text. Get comfortable here and a large slice of your daily code becomes effortless.

The core string methods

Python is strict about types when it comes to strings. You cannot add a string and an integer, so "reviews: " + 5 raises a TypeError, and the old fix was to cast the number first with str().

			
count = len(review)
print("Number of characters in this review: " + str(count))

This strictness is deliberate. In some languages "5" + 5 silently becomes "55" while 5 + 5 becomes 10, the same operator behaving differently by context, and Python refuses to guess. In modern code you would skip this whole dance with an f-string, f"Number of characters: {len(review)}", which needs no casting and no concatenation. Recognise the old + str(...) + pattern when you meet it in legacy code, but do not write new code that way.

Slicing is Python’s universal “give me a sub-piece” tool, and the same syntax works on strings, lists, and any sequence. The one gotcha worth repeating is that the stop index is exclusive, so review[0:5] returns five characters, indices zero through four. Because strings are immutable, a slice always returns a new string and leaves the original untouched. A few idioms are worth typing from memory: s[:n] for the first n characters, s[n:] from n to the end, s[-n:] for the last n, and the famous s[::-1], which walks the string backwards with a step of minus one to reverse it. That reverse trick gives you a clean palindrome check: reverse the string and compare it to the original, and if they are equal it reads the same both ways.

Normalising text is the next layer, and three methods do most of it. .lower() case-folds to lowercase, .strip() removes characters from both ends, and .split() breaks a string into a list of tokens on whitespace.

clean = review.lower().strip("$").split()

There is a real trap hiding in .strip(), though. Its argument is a set of characters to trim, not a literal substring. So "hello<\i>i".rstrip("<\\i>") does not remove the literal tag, it strips any of the characters <, \, i, and > one at a time from the right, which can eat more than you intended. When you genuinely want to remove a literal suffix, use .removesuffix("<\\i>")or .replace("<\\i>", "") instead. The split-transform-rejoin pattern pairs .split() with .join() and is the workhorse for reshaping delimited text, while .splitlines() is the smarter sibling of splitting on a newline, because it handles Unix, Windows, and old Mac line endings all at once, which makes it the right tool for reading cross-platform text files.

Searching strings comes down to three methods answering three questions. .find(sub) answers “where is it?” returning an index or minus one if absent, .count(sub) answers “how many times?” and .replace(old, new) answers “swap all of these for those.” A close relative of .find() is .index(), and the difference matters: .find() is forgiving and returns minus one when the substring is missing, while .index() is strict and raises a ValueError, so you use .index() when a missing substring is genuinely an error and .find() when you will handle the not-found case yourself. Two cautions on .replace(): it returns a new string so you must capture the result, and it is not word-aware, so replacing “is” inside “isn’t” turns it into “Xn’t”. When you need word boundaries or context, that is exactly what regex is for.

Three ways to format

Python has three formatting systems, and knowing when to use each is the point. The oldest still in wide use is the .format()method, with {} placeholders filled by the arguments in order, or {0} and {1} for explicit positions that let you reorder or repeat. It can reach into a dictionary inside a placeholder, with the quirk that the key takes no quotes, so it is {data[field]}not {data['field']}. And after a colon it accepts a format spec, including strftime codes for dates.

			
from datetime import datetime
now = datetime.now()
print("Today is {today:%B %d, %Y}, and it is {today:%H:%M}".format(today=now))

The modern, recommended choice is the f-string, introduced in Python 3.6, which evaluates its placeholders right where they are written.

			
print(f"{field} makes up {fraction:.2f}% of the data but only {analysed:.1f}% is analysed")

The power of f-strings is that any Python expression can go inside the braces, arithmetic, method calls, indexing, even a conditional, so you can divide and format in one place with something like f"{count/minutes:.1f} per minute". The format spec after the colon is the same machinery as .format(), giving you :.2f for two decimals, :,.0f for thousands separators, :.1% for percentages, and :>10 or :^10 for alignment. The one thing f-strings cannot do is be defined now and filled later, because they evaluate immediately, which is occasionally a reason to keep a .format() template around.

The third system, Template from the string module, is the least known and exists for one reason: safety. It is deliberately limited to simple $identifier placeholders with no expressions, no method calls, and no attribute access.

			
from string import Template
msg = Template("$tool is $description")
print(msg.substitute(tool=tool_name, description=tool_desc))

That limitation is the feature. When a format string itself comes from outside your trusted code, a config file, a database, a user, both .format() and f-strings can be exploited to reach into Python’s internals, but a Template can only do name substitution and nothing more. The syntax adds ${identifier} braces for when a placeholder sits next to letters, as in ${pay}ly, and $$ for a literal dollar sign. There are two ways to fill it: .substitute() raises a KeyError if any placeholder is missing, while .safe_substitute() leaves missing placeholders untouched rather than failing, which suits partial data. Use Template for user-supplied formats, and f-strings everywhere else.

Regular expressions

Where string methods match literal text, regex matches patterns: any digit, one or more letters, anything followed by a period. It is the difference between asking whether the word “cat” appears and asking whether any three-letter word appears. Two things matter before anything else. Always write patterns as raw strings with the r"..." prefix, because without it Python pre-processes the backslashes and your \d may not survive, and forgetting the r is the single most common beginner bug. And learn the core metacharacters: \d for a digit, \w for a word character, \s for whitespace, . for any character, with the uppercase \D, \W, and \S as their negations, plus ^ and $ for the start and end of the string.

			
import re
print(re.findall(r"@robot\d\W", post))

That pattern matches a literal @robot, then any digit, then any non-word character, and re.findall returns every substring that matches. Quantifiers control how many times the previous element repeats: * for zero or more, + for one or more, ? for optional, {n} for exactly n, and {n,m} for a range. So \w+ is the standard “any word”, \d{4} is exactly four digits for a year, and http\S+ grabs a URL as “http” followed by one or more non-space characters. Character classes in square brackets match one character from a set, so [aeiou] is any vowel, [a-z] any lowercase letter, and [^abc] anything except a, b, or c, giving you precise control beyond the built-in shorthands.

A family of functions runs these patterns, and choosing the right one matters. re.findall returns all matches, re.search finds the first match anywhere, re.match only checks the start of the string, and re.fullmatch requires the entire string to match. This distinction trips people up in validation: a naive email check with re.match will pass “alice@example.com and some junk” because only the start matched, so strict validation wants re.fullmatch. There is also re.sub for pattern-based replacement, which is .replace() for patterns and can take a function as its replacement, and re.split for splitting on a pattern, where re.split(r"\s+", text) robustly handles runs of inconsistent whitespace.

Parentheses do double duty in regex, grouping for quantifiers and alternation, and capturing the matched text for extraction. This is the killer feature for pulling structured data out of semi-structured text.

			
regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"
match = re.findall(regex, flight)

Each match here is a tuple of five captured pieces, an airline code, flight number, departure, destination, and date, which you can index out individually. A subtle behaviour to know is how capturing changes what re.findall returns: with no groups it returns the full matches, with one group it returns just that group, and with several groups it returns tuples, which surprises people who added a group only to apply a quantifier. The fix is a non-capturing group, (?:...), which groups without returning, and is also where alternation with | lives, as in (love|like|enjoy). For readability, named groups written (?P<day>\d{2}) let you pull a value out by name rather than a numeric index. And backreferences let a pattern refer to what it already captured, so <(\w+)>.*?</\1> matches an HTML tag only when the closing tag name matches the opening one, and (\w)\1 catches a doubled letter for spotting elongated words like “soooo”.

The concept that causes the most confusion is greedy versus lazy matching. Quantifiers are greedy by default and match as much as possible, while adding a ? makes them lazy and match as little as possible. Stripping HTML tags is the classic example.

clean = re.sub(r"<.+?>", "", text)

Against “see that <strong>amazing show</strong> again”, the greedy <.+> matches from the first < to the last >, swallowing the content as one huge match, whereas the lazy <.+?> matches each tag separately, which is what you want. The rule of thumb is that when you match delimited content, parentheses, tags, quotes, you almost always go lazy, and remember to escape literal metacharacters, since \( is a literal parenthesis while ( opens a group. For genuine numbers and whole words, greedy is correct, because [0-9]+ should grab “24” as a whole rather than two single digits.

The most advanced feature is lookarounds, which assert a condition about the surroundings without consuming the characters. There are four: positive and negative lookahead, (?=x) and (?!x), and positive and negative lookbehind, (?<=x)and (?<!x).

following = re.findall(r"(?<=[Pp]ython\s)\w+", post)

This finds the word after “Python ” without including “Python” itself, because lookarounds are zero-width. One real limitation to know in Python’s standard re module is that lookbehinds must be fixed width, so you cannot write (?<=\w+), and if you need variable-width lookbehind you reach for the third-party regex package.

Choosing the right tool

Stepping back, the decisions are simple once the pieces are clear. For formatting, default to f-strings for your own code, keep .format() for templates you fill later, and use Template whenever the format string comes from an untrusted source. For matching, pick the function by intent: findall for all matches, search for the first, match for the start, fullmatch for the whole string, sub to replace, and split to tokenise. Use capturing groups to extract parts, non-capturing groups when you only need grouping, and backreferences when the pattern must refer to itself. When a pattern matches too much, make the quantifier lazy, and when you need to check the context without consuming it, reach for a lookaround. And whatever you do, build patterns up gradually and lean on a tester like regex101 rather than guessing, because reading a complex pattern left to right, one piece at a time, is the real skill that makes regex approachable.

See you soon.

JavaScript Control Flow and Loops

JavaScript Operators: An Introduction

JavaScript Arrays: An Introduction

JavaScript Objects: An Introduction

Python Strings, Formatting, and Regex

The core string methods

Three ways to format

Regular expressions

Choosing the right tool

Related

Leave a ReplyCancel reply

Recommended for You

Python Data Structures and Collections

Statistical Visualisation in Python with Seaborn

JavaScript Control Flow and Loops

JavaScript Operators: An Introduction

JavaScript Arrays: An Introduction

JavaScript Objects: An Introduction

Python Strings, Formatting, and Regex

The core string methods

Three ways to format

Regular expressions

Choosing the right tool

Related

Leave a ReplyCancel reply

Subscribe to My Newsletter

Recommended for You

Python Data Structures and Collections

Statistical Visualisation in Python with Seaborn

Discover more from Discuss Data Science, Machine Learning and Analytics