Most datasets are full of columns that repeat the same handful of values over and over: country, plan tier, job title, survey response. Treating those as plain text is wasteful and limiting. Pandas has a dedicated categorical dtype that fixes both problems at once, and it pays off three ways. It saves memory, often dramatically. It lets you impose a meaningful order, so Bronze ranks below Silver below Gold rather than sorting alphabetically. And it makes group-by operations, encoding for machine learning, and plotting cleaner and more correct.
The mechanism is simple. A categorical column stores each unique value once in a lookup table, then each row holds a small integer code pointing back to it. For a column with lots of repetition, that can be ten to a hundred times smaller than storing the strings outright. This article walks the whole workflow: creating and converting categoricals, the .cat accessor, cleaning messy categories, grouping, encoding for models, and visualizing with seaborn.
Creating and Converting Categoricals
The most common move is converting an existing column with astype("category"), but you can also declare the type at load time, which skips the intermediate string representation entirely and saves both memory and parse work.
import pandas as pdemployees["job_title"] = employees["job_title"].astype("category")employees = pd.read_csv("staff.csv", dtype={"department": "category", "education": "category"})ranks = pd.Categorical( medals_won, categories=["Bronze", "Silver", "Gold"], ordered=True)
The pd.Categorical constructor is the one to reach for when order matters. Passing ordered=True along with the category list turns the column into a true ranking, so comparisons like Bronze < Silver work and sorts come out in the natural sequence rather than alphabetically. Without it, the categories are just unordered labels. You can confirm the memory win directly by comparing the .nbytes of an object Series against the same data as a category, and for repetitive columns the difference is stark.
The .cat Accessor
Just as .str is the namespace for string operations and .dt for datetime work, .cat is the namespace for everything that touches the category structure itself rather than the row values.
df["size"].cat.categories # current categories, in orderdf["size"].cat.codes # integer code per row, -1 for NaNdf["size"] = df["size"].cat.add_categories(["extra_large"])df["size"] = df["size"].cat.remove_categories(["unwanted"])df["size"] = df["size"].cat.rename_categories({"sm": "small"})df["size"] = df["size"].cat.reorder_categories( ["small", "medium", "large"], ordered=True)
.categories shows the allowed set and .codes shows the underlying integers. The modifying methods all change which values are permitted: add_categories extends the set without altering existing rows, remove_categories drops values (and crucially turns any rows still using them into NaN, so be careful), and rename_categories relabels by dict or transforms by a callable. reorder_categories changes the order, which in turn changes how the column sorts and compares. The pattern is always the same, column then .cat then a category-level operation, and reassigning the result is cleaner than using the inplaceargument.
Cleaning Messy Categories
Real categorical data arrives full of typos, inconsistent capitalization, and stray whitespace, where ” Male”, “male”, and “MALE” are three different strings to the computer. The reliable pipeline handles these in order and converts to categorical only at the end.
respondents["sex"] = respondents["sex"].replace({"Malez": "male"})respondents["sex"] = respondents["sex"].str.strip()respondents["sex"] = respondents["sex"].str.lower()respondents["sex"] = respondents["sex"].astype("category")
Fix known typos with replace, strip whitespace, standardize case so the variants collapse into one value, and only then declare the type. Converting first would force you to manage the category set midway through cleanup, which is needless friction. Clean the strings, then commit the type.
The same replace tool collapses fine-grained categories into broader buckets, which matters before modeling because high-cardinality columns explode into hundreds of one-hot columns and rare values carry little signal.
collapse_coat = {"wirehaired": "medium", "medium-long": "medium"}pets["coat_grouped"] = pets["coat"].replace(collapse_coat).astype("category")
Two cleanup situations need a specific two-step touch. To delete a category that some rows still use, first reassign those rows to another value, then prune the now-unused category, because removing it while rows depend on it would silently turn them into NaN.
pets.loc[pets["likes_children"] == "maybe", "likes_children"] = "no"pets["likes_children"] = pets["likes_children"].cat.remove_categories(["maybe"])
And because NaN is not itself a category, handle missing values before converting, assigning them a meaningful default so downstream group-bys, plots, and models treat them as a real group rather than skipping them.
pets.loc[pets["body_type"].isna(), "body_type"] = "other"pets["body_type"] = pets["body_type"].astype("category")
GroupBy with Ordered Categories
GroupBy works on categorical columns like any other, with one valuable bonus: when the categories have an explicit order, the output respects it instead of sorting alphabetically.
employees.groupby("sex")["hours_per_week"].mean()employees.groupby(["sex", "income_bracket"]).size()dimensions = ["education", "income_bracket"]employees.groupby(dimensions)["hours_per_week"].mean()pets["size"] = pets["size"].cat.reorder_categories( ["small", "medium", "large"], ordered=True)pets.groupby("size")["sex"].value_counts()
After reordering to small, medium, large, the grouped results appear in that natural sequence rather than large, medium, small. Notice too that you can pass groupby a variable holding a list of column names, which lets a notebook or a user configure the grouping dimensions without rewriting the code.
Encoding for Machine Learning
Most algorithms cannot take string categories directly, so you encode them, and the right encoding depends on the model and the data.
Label encoding assigns each category an integer, which pandas gives you for free through .cat.codes. Always save the mapping so you can translate predictions back into readable labels later.
cars["color"] = cars["color"].astype("category")cars["color_code"] = cars["color"].cat.codescolor_map = dict(zip(cars["color"].cat.codes, cars["color"]))
The catch is that label encoding implies an ordering. That is harmless for tree-based models like random forests and XGBoost, which split on values without caring about magnitude, but it misleads linear models, which would read color codes 0, 1, 2 as “the third color is greater than the second.” When you need to decode model output, .map is the inverse operation, replacing each code with its original label.
cars["color"] = cars["color"].map(color_map)
When you only care whether a value matches some criterion, boolean encoding produces a clean 1/0 flag. np.where takes a condition and two values and picks element-wise between them.
import numpy as npcars["is_volkswagen"] = np.where( cars["manufacturer"].str.contains("Volkswagen", regex=False), 1, 0)
And when categories are unordered and you are feeding a linear model or neural network, one-hot encoding is usually what you want, turning one column into one binary column per category so no false ordering is imposed.
cars_encoded = pd.get_dummies( cars, columns=["manufacturer", "transmission"], prefix="dummy")
The danger with one-hot encoding is cardinality: a column with thousands of unique values becomes thousands of columns, which is exactly why collapsing rare categories first pays off. The short rule is label encoding for ordered categories and tree models, boolean for simple yes/no flags, and one-hot for unordered categories with linear models.
Seaborn Categorical Plots
sns.catplot is the swiss army knife of categorical visualization: change the kind argument and the same call produces a different chart type.
import seaborn as snsimport matplotlib.pyplot as pltsns.set(font_scale=1.4)sns.set_style("whitegrid")sns.catplot(x="traveler_type", y="helpful_votes", data=reviews, kind="box")sns.catplot(x="stay_period", y="helpful_votes", data=reviews, kind="bar")sns.catplot(x="venue", y="score", data=reviews, kind="bar", hue="free_internet")sns.catplot(x="review_weekday", col="stay_period", kind="count", col_wrap=2, data=reviews)plt.show()
Use kind="box" for distribution comparisons, kind="bar" for means with confidence intervals, kind="point" for compact mean-and-error dots (where dodge=True separates overlapping groups), and kind="count" for raw frequencies, which needs no y. Add hue to split by a second category using color, and add col to produce a grid of subplots, one per category value, where col_wrap starts a new row after a set number of charts so you avoid a single endless row. The styling calls only need to run once per session; they set defaults for everything that follows.
The default output is rarely presentation-ready, so capture the returned grid and polish it.
g = sns.catplot(x="free_internet", y="score", hue="traveler_type", kind="bar", data=reviews, palette=sns.color_palette("Set2"))g.fig.suptitle("Average Rating by Internet Access")g.set_axis_labels("Free Internet", "Avg Rating")plt.subplots_adjust(top=0.93)plt.show()
fig.suptitle adds a title above the whole figure (the .fig because the title sits at the figure level, not inside a single subplot), set_axis_labels names both axes in one call, and plt.subplots_adjust(top=0.93) fixes the common annoyance of the title overlapping the top subplot by nudging the plot area down. The palette argument controls the colour scheme, with “Set2” a friendly choice for categorical data.
The Daily Toolkit
A handful of one-liners cover most routine categorical work. value_counts() is the workhorse, showing which categories exist and how common each is, with normalize=True switching to proportions and dropna=False including NaN as its own row. nunique() answers “how many distinct values?” more cheaply than counting value_counts. select_dtypes("object") filters to the text columns when you are hunting for cleanup candidates. And the .str methods, contains, strip, lower, title, run element-wise across a whole column with no loop needed.
reviews["traveler_type"].value_counts(normalize=True)reviews["traveler_type"].nunique()reviews.select_dtypes("object")reviews["language"].str.contains("English", regex=False)
Conclusion
Categorical data in pandas is one of those features that quietly improves everything once you adopt it. Convert repetitive text columns with astype("category"), or better, declare the type at load time, and use pd.Categorical with ordered=True when the values have a natural ranking. Do all your typo-fixing, whitespace-stripping, and case-standardizing on the strings first, then convert, and handle NaN and rare-value collapsing before modeling. Reach into .cat for anything that touches the category set itself, lean on ordered categories to make group-bys and plots come out in sensible order, and pick your encoding by the model: label for trees, boolean for flags, one-hot for linear models on unordered data. Get into these habits and your data is smaller, your plots are clearer, and your models get exactly the inputs they expect.
[…] Categorical Data in Pandas […]