Most data science tutorials hand you a clean CSV and skip straight to the model. Real projects do not work like that. The data arrives messy, the business wants answers in pounds rather than AUC, and at least one column is lying to you.

This Code Along gives you the full experience in a single notebook: a churn prediction analysis for Brightcart Club, a fictional ecommerce membership programme. The retention team can contact 150 members a month. Your job is to tell them which 150.

What you will work through

The notebook follows the project lifecycle from start to finish, in plain sequential code with a comment on nearly every line. No frameworks, no functions to untangle, just the analysis:

Problem definition. Success metrics and five hypotheses, written down before touching the data.
Data collection. Loading the snapshot and reading it properly before trusting it.
Preprocessing. Duplicate rows, “UK” vs “uk” vs “U.K.”, members aged 999, negative session times, and missing values with three different mechanisms, each needing different treatment.
EDA. Testing all five hypotheses honestly, and finding the interaction the model will later confirm.
Feature engineering. Including the moment every data scientist meets eventually: a column that predicts churn almost perfectly, because it was written after the churn happened. You will catch it with one crosstab.
Model selection and training. Baselines first, then a fair fight between logistic regression, a decision tree and a random forest.
Documentation and handoff. A ranked outreach list, a model card and the reproducibility details.

The result

The twist worth the price of admission: after tuning both finalists, the simplest model wins. A logistic regression reaches a test ROC-AUC of 0.756 and beats the tuned random forest, while staying fully explainable.

Against the retention team’s current rule of thumb (“call whoever has not ordered in 60+ days”), the model lifts precision in the top 150 from 38.7% to 55.3%. Translated into money, that is roughly £19k of retained revenue per year from the same 150 phone calls.

Fictional dataset: members Download

How to get the most out of it

Do not just run all cells. Before each EDA section, pause and predict what the chart will show. Before the leakage reveal, see if you can spot the suspicious column yourself in the data dictionary. The notebook will still be there when you are wrong, and being wrong is where the learning is.

If you build something on top of the data, or your model beats 0.756 honestly, I want to hear about it.

Brightcart Club Churn — A Complete Data Science Code Along

Welcome to the Datalad Code Along. In this notebook we work through a real data science analysis from start to finish: predicting which members of Brightcart Club (a fictional ecommerce membership programme) will churn in the next 90 days.

The data is simulated but deliberately messy — duplicated rows, inconsistent labels, missing values with three different mechanisms, outliers and one booby-trapped column. Exactly the things real data does to you.

Content:

Problem Definition
Data Collection
Data Preprocessing
Exploratory Data Analysis
Feature Engineering
Model Selection
Model Training
Documentation & Handoff

1. Problem Definition

No code yet — and that is the point. The most expensive mistakes in data science happen before the first line of Python.

Business objective. Brightcart Club members pay £4.99–£14.99/month for free shipping and member pricing. Cancellations are creeping up, and every churned member costs the subscription fee plus roughly 60% of their shopping revenue. The retention team can contact about 150 members per month with a tailored offer.

Problem statement. Predict which Club members will churn in the 90 days after the snapshot date (2025-12-31), so the retention team can prioritise its limited outreach capacity.

Success metrics. The team works a ranked list, so what matters is precision in the top 150 — every slot wasted on a member who was staying anyway is a slot lost. We also track ROC-AUC for overall ranking quality. The baseline to beat: the current heuristic, “call whoever hasn’t ordered in 60+ days”.

Hypotheses (written down now, tested in EDA — not invented after looking at the data):

Payment failures are the strongest single churn signal.
Members who engage with email rarely churn.
Discount-dependent members churn more.
Tenure protects — the longer a member has stayed, the likelier they stay.
Premium members churn less than Basic members.

Scope and constraints. A monthly ranked list is enough (no real-time scoring). The retention team wants to understand why a member is flagged, so the model must be explainable.

Risks. Some CRM fields are written after churn happens — using them would mean predicting the past (leakage). The label itself is slightly noisy: a few “churned” members reactivate later.

Stakeholders. Retention lead (wants the list), finance (wants the £ impact), the data protection officer (age and country are personal data — minimise, document, and don’t ship them into the model without a reason).

2. Data Collection

One snapshot CSV from the warehouse: one row per active member on 2025-12-31, with behaviour aggregated over the trailing 90 days and the churn label observed over the following 90. The schema lives in DATA_DICTIONARY.md — reading the dictionary before the data saves you from discovering surprises the slow way.

Since 8,550 rows fit in memory thousands of times over, no sampling strategy is needed — we analyse the full population.

# import the core analysis stack
import numpy as np                  # numerical computing
import pandas as pd                 # dataframes
import matplotlib.pyplot as plt     # plotting
import seaborn as sns               # statistical plotting on top of matplotlib
import warnings, os                 # to silence noisy library warnings
warnings.filterwarnings('ignore')   # keep the notebook output clean
os.environ['PYTHONWARNINGS'] = 'ignore'   # ...including in parallel worker processes

sns.set_theme(style='whitegrid')    # readable default grid style
COBALT, ACID, INK = '#2438F5', '#C9DD2E', '#161510'   # Datalad colours
pd.set_option('display.max_columns', 50)              # show every column
RANDOM_STATE = 42                                     # one seed to rule them all

# load the snapshot
df = pd.read_csv('data/members.csv')   # read_csv = parse the CSV into a DataFrame
print(df.shape)                        # (rows, columns)
df.head()                              # head = first 5 rows, our first look

(8550, 27)

	member_id	signup_date	snapshot_date	age	country	plan_type	monthly_fee	payment_method	auto_renew	acquisition_channel	tenure_months	orders_last_90d	avg_order_value	total_spend_last_90d	days_since_last_order	sessions_per_month	avg_session_minutes	app_user	marketing_opt_in	email_open_rate	discount_share_orders	support_tickets_last_6m	avg_resolution_hours	payment_failures_last_6m	nps_score	winback_email_sent	churned_90d
0	M08191	2021-10-29	2025-12-31	NaN	Ireland	Premium	14.99	Apple Pay	True	Paid Search	50	3	28.72	82.82	20	7.7	15.6	True	True	0.30	0.50	1	12.7	0	NaN	True	False
1	M06966	2021-10-28	2025-12-31	29.0	France	Plus	9.99	Card	True	Organic Search	50	8	16.58	127.12	11	1.6	17.8	True	True	0.52	0.55	2	9.6	2	9.0	False	False
2	M07542	2024-03-28	2025-12-31	25.0	United Kingdom	Basic	4.99	Card	True	Social Media	21	4	37.65	159.68	38	5.7	21.7	True	True	0.26	0.52	0	NaN	0	7.0	False	False
3	M03830	2024-09-18	2025-12-31	35.0	United Kingdom	Basic	4.99	Card	True	Referral	15	1	10.55	11.02	66	16.1	17.9	True	True	0.24	0.15	0	NaN	0	NaN	False	False
4	M02034	2024-11-20	2025-12-31	20.0	Germany	premium	14.99	Apple Pay	False	Organic Search	13	2	39.72	78.81	21	0.9	12.9	True	True	0.64	0.15	0	NaN	0	7.0	False	False

# info = column names, non-null counts and dtypes in one screen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8550 entries, 0 to 8549
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   member_id                 8550 non-null   object 
 1   signup_date               8550 non-null   object 
 2   snapshot_date             8550 non-null   object 
 3   age                       8044 non-null   float64
 4   country                   8550 non-null   object 
 5   plan_type                 8550 non-null   object 
 6   monthly_fee               8550 non-null   float64
 7   payment_method            8550 non-null   object 
 8   auto_renew                8550 non-null   bool   
 9   acquisition_channel       8550 non-null   object 
 10  tenure_months             8550 non-null   int64  
 11  orders_last_90d           8550 non-null   int64  
 12  avg_order_value           8550 non-null   float64
 13  total_spend_last_90d      8550 non-null   float64
 14  days_since_last_order     8550 non-null   int64  
 15  sessions_per_month        8550 non-null   float64
 16  avg_session_minutes       8222 non-null   float64
 17  app_user                  8550 non-null   bool   
 18  marketing_opt_in          8550 non-null   bool   
 19  email_open_rate           5474 non-null   float64
 20  discount_share_orders     8550 non-null   float64
 21  support_tickets_last_6m   8550 non-null   int64  
 22  avg_resolution_hours      5008 non-null   float64
 23  payment_failures_last_6m  8550 non-null   int64  
 24  nps_score                 4340 non-null   float64
 25  winback_email_sent        8550 non-null   bool   
 26  churned_90d               8550 non-null   bool   
dtypes: bool(5), float64(10), int64(5), object(7)
memory usage: 1.5+ MB

# describe = count, mean, std, min, quartiles, max for every numeric column
# the .T transposes it so columns become rows - easier to scan
df.describe().T.round(2)

	count	mean	std	min	25%	50%	75%	max
age	8044.0	36.59	30.27	0.00	28.00	34.00	42.00	999.00
monthly_fee	8550.0	8.39	3.69	4.99	4.99	9.99	9.99	14.99
tenure_months	8550.0	16.26	14.41	1.00	5.00	12.00	23.00	59.00
orders_last_90d	8550.0	2.53	1.96	0.00	1.00	2.00	4.00	14.00
avg_order_value	8550.0	36.79	21.65	3.76	21.72	31.82	45.74	197.69
total_spend_last_90d	8550.0	92.39	99.52	0.00	28.30	64.61	125.76	1304.09
days_since_last_order	8550.0	37.90	24.11	1.00	19.00	32.00	53.00	90.00
sessions_per_month	8550.0	5.34	3.81	0.20	2.50	4.70	7.40	29.90
avg_session_minutes	8222.0	13.45	5.97	-23.80	9.30	13.30	17.60	36.20
email_open_rate	5474.0	0.34	0.19	0.00	0.19	0.32	0.47	0.95
discount_share_orders	8550.0	0.28	0.18	0.00	0.14	0.26	0.40	0.92
support_tickets_last_6m	8550.0	0.91	0.99	0.00	0.00	1.00	1.00	7.00
avg_resolution_hours	5008.0	23.42	18.57	1.60	11.30	18.60	29.70	214.30
payment_failures_last_6m	8550.0	0.22	0.47	0.00	0.00	0.00	0.00	4.00
nps_score	4340.0	7.45	1.73	1.00	6.00	8.00	9.00	10.00

Already three things jump out of describe(): age has a max of 999 (nobody is 999), avg_session_minutes has a negative minimum (nobody browses for minus twelve minutes), and avg_order_value has a max far above its 75th percentile (skew, whales, or both). We deal with all of it in preprocessing — that is what preprocessing is for.

3. Data Preprocessing

3.1 Duplicates and schema checks

Always check for duplicates before anything else — every later statistic is wrong if some members are counted twice.

# duplicated() flags repeated rows; sum() counts them
print('exact duplicate rows :', df.duplicated().sum())
print('duplicate member ids :', df['member_id'].duplicated().sum())

# drop them, keeping the first occurrence of each
df = df.drop_duplicates(subset='member_id', keep='first').reset_index(drop=True)
print('shape after dedup    :', df.shape)

exact duplicate rows : 50
duplicate member ids : 50
shape after dedup    : (8500, 27)

# quick schema sanity checks - cheap assertions that catch silent disasters
print('member_id unique :', df['member_id'].is_unique)
print('fee range        :', df['monthly_fee'].min(), '-', df['monthly_fee'].max())
print('orders >= 0      :', (df['orders_last_90d'] >= 0).all())
print('open rate in 0-1 :', df['email_open_rate'].dropna().between(0, 1).all())
print('target values    :', df['churned_90d'].unique())

member_id unique : True
fee range        : 4.99 - 14.99
orders >= 0      : True
open rate in 0-1 : True
target values    : [False  True]

3.2 Handling categorical mess

value_counts() on every categorical column is the fastest data-quality audit there is. Watch what it finds in country:

# dropna=False also shows missing values as their own row
print(df['country'].value_counts(dropna=False))

country
United Kingdom    2618
Germany           1134
France             994
Ireland            762
Netherlands        751
Spain              620
U.K.               327
uk                 322
UK                 319
germany            123
DE                 123
FR                  96
NL                  88
spain               82
ireland             81
ES                  60
Name: count, dtype: int64

# the same country hides under several spellings - map them to one canonical label
country_map = {'UK': 'United Kingdom', 'uk': 'United Kingdom', 'U.K.': 'United Kingdom',
               'DE': 'Germany', 'germany': 'Germany',
               'FR': 'France', 'ireland': 'Ireland', 'NL': 'Netherlands',
               'ES': 'Spain', 'spain ': 'Spain'}

df['country'] = df['country'].str.strip()        # strip = remove stray whitespace
df['country'] = df['country'].replace(country_map)  # replace = apply the mapping
print(df['country'].value_counts())

country
United Kingdom    3586
Germany           1380
France            1090
Ireland            843
Netherlands        839
Spain              680
spain               82
Name: count, dtype: int64

# plan_type has the same disease: 'basic ', 'PLUS', 'premium '
print('before:', sorted(df['plan_type'].unique()))
df['plan_type'] = df['plan_type'].str.strip().str.title()  # strip spaces, Title Case
print('after :', sorted(df['plan_type'].unique()))

before: ['Basic', 'Basic ', 'PLUS', 'Plus', 'Premium', 'basic', 'plus', 'premium ']
after : ['Basic', 'Plus', 'Premium']

3.3 Dealing with outliers

Rule one of outliers: data errors and genuine extremes are different animals. Errors get removed; genuine extremes get respected (and maybe capped). Age 999 is an error. A member who really spends £400 per order is your best customer — deleting them would be analytical self-harm.

# data-entry errors first: age sentinels and impossible negative durations
print('age = 999 :', (df['age'] == 999).sum(), 'rows')
print('age = 0   :', (df['age'] == 0).sum(), 'rows')
print('negative session minutes :', (df['avg_session_minutes'] < 0).sum(), 'rows')

# convert errors to NaN - they are not information, they are typos
df['age'] = df['age'].replace({999: np.nan, 0: np.nan})
df.loc[df['avg_session_minutes'] < 0, 'avg_session_minutes'] = np.nan

age = 999 : 7 rows
age = 0   : 5 rows
negative session minutes : 5 rows

# now the genuine extremes: average order value
fig, axes = plt.subplots(1, 2, figsize=(12, 3.5))          # two plots side by side
sns.boxplot(x=df['avg_order_value'], color=COBALT, ax=axes[0])  # boxplot shows the whales
axes[0].set_title('avg_order_value - raw')

# IQR fences: the classic definition of 'unusually far out'
q1, q3 = df['avg_order_value'].quantile([0.25, 0.75])      # quartiles
iqr = q3 - q1                                              # interquartile range
upper_fence = q3 + 1.5 * iqr
print('upper IQR fence :', round(upper_fence, 2))
print('rows above it   :', (df['avg_order_value'] > upper_fence).sum())

# these are real whales, not errors - so we cap (winsorise) at the 99th percentile
# instead of deleting paying customers
cap = df['avg_order_value'].quantile(0.99)
df['avg_order_value'] = df['avg_order_value'].clip(upper=cap)   # clip = cap values
df['total_spend_last_90d'] = df['total_spend_last_90d'].clip(
    upper=df['total_spend_last_90d'].quantile(0.99))

sns.boxplot(x=df['avg_order_value'], color=ACID, ax=axes[1])
axes[1].set_title('avg_order_value - capped at p99')
plt.tight_layout()
plt.show()

upper IQR fence : 81.81
rows above it   : 364

3.4 Handling missing values

Before imputing anything, ask why each value is missing. The textbook names three mechanisms, and all three live in this dataset:

MCAR (missing completely at random) — pure bad luck. Our age.
MAR (missing at random, explained by another column) — email_open_rate is missing exactly when marketing_opt_in is False. Not random at all: those members receive no emails.
MNAR (missing not at random, related to the missing value itself) — nps_score. Unhappy members skip surveys. The gap itself is a signal.

# the missingness audit: how many, and what share
missing = df.isnull().sum().sort_values(ascending=False)   # count NaN per column
missing = missing[missing > 0]                             # keep only the guilty
missing_pct = (missing / len(df) * 100).round(1)           # as a percentage
print(pd.DataFrame({'missing': missing, 'pct': missing_pct}))

missing_pct.plot(kind='barh', color=COBALT, figsize=(8, 3))
plt.title('Missing values by column (%)')
plt.show()

                      missing   pct
nps_score                4183  49.2
avg_resolution_hours     3517  41.4
email_open_rate          3057  36.0
age                       513   6.0
avg_session_minutes       332   3.9

# evidence that nps_score is MNAR: churn among survey-skippers vs responders
churn_skipped  = df.loc[df['nps_score'].isnull(), 'churned_90d'].mean()
churn_answered = df.loc[df['nps_score'].notnull(), 'churned_90d'].mean()
print(f'churn rate when NPS missing  : {churn_skipped:.1%}')
print(f'churn rate when NPS answered : {churn_answered:.1%}')

# and that email_open_rate is structural (MAR): missing only when not opted in
print(pd.crosstab(df['marketing_opt_in'], df['email_open_rate'].isnull()))

churn rate when NPS missing  : 23.5%
churn rate when NPS answered : 19.7%
email_open_rate   False  True 
marketing_opt_in              
False                 0   3057
True               5443      0

# impute - each column according to its mechanism
df['nps_missing'] = df['nps_score'].isnull().astype(int)   # keep the signal as a flag!
df['nps_score'] = df['nps_score'].fillna(df['nps_score'].median())

# MCAR columns: median is fine (median resists the skew we have not fixed yet)
df['age'] = df['age'].fillna(df['age'].median())
df['avg_session_minutes'] = df['avg_session_minutes'].fillna(df['avg_session_minutes'].median())

# structural ones: 'missing' actually means zero exposure
df['email_open_rate'] = df['email_open_rate'].fillna(0)    # no emails -> no opens
df['avg_resolution_hours'] = df['avg_resolution_hours'].fillna(0)  # no tickets -> nothing to resolve

print('remaining missing values :', df.isnull().sum().sum())

remaining missing values : 0

Note what we did not do: blindly mean-impute email_open_rate. That would have invented a fictional 35% open rate for people who receive no email — and quietly poisoned every model downstream. The flag column nps_missing keeps the MNAR signal alive as a feature.

3.5 Handling skewed data

Money columns are almost always right-skewed: many small values, a long tail of big ones. A log transform pulls the tail in, which helps linear models and makes plots readable.

# skew() > 1 is conventionally 'highly skewed'
skews = df[['avg_order_value', 'total_spend_last_90d',
            'support_tickets_last_6m', 'sessions_per_month']].skew().round(2)
print(skews)

# log1p = log(1 + x), safe for zeros
df['log_spend'] = np.log1p(df['total_spend_last_90d'])
df['log_aov'] = np.log1p(df['avg_order_value'])

fig, axes = plt.subplots(1, 2, figsize=(12, 3.5))
df['total_spend_last_90d'].hist(bins=50, color=COBALT, ax=axes[0])  # before
axes[0].set_title('total_spend_last_90d - skewed')
df['log_spend'].hist(bins=50, color=ACID, ax=axes[1])               # after
axes[1].set_title('log1p(total_spend) - roughly symmetric')
plt.tight_layout()
plt.show()

avg_order_value            1.38
total_spend_last_90d       1.80
support_tickets_last_6m    1.20
sessions_per_month         1.10
dtype: float64

3.6 Data types and normalisation

Last tidy-up: booleans become 0/1 integers, dates become real datetimes. One deliberate decision: we do not scale the data here. Scaling must be fitted on the training split only (otherwise the test set leaks into the transform), so it belongs inside the modelling pipeline — you will see it there in section 7.

# booleans -> int so models and correlations can digest them
for col in ['auto_renew', 'app_user', 'marketing_opt_in', 'winback_email_sent', 'churned_90d']:
    df[col] = df[col].astype(int)

# strings -> proper datetimes
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['snapshot_date'] = pd.to_datetime(df['snapshot_date'])

print(df.dtypes.value_counts())   # final dtype inventory
print('clean dataset :', df.shape)

float64           12
int64             11
object             5
datetime64[ns]     2
Name: count, dtype: int64
clean dataset : (8500, 30)

4. Exploratory Data Analysis

Preprocessing made the data correct; EDA makes it understood. We move from “what is in the columns” to “what is driving churn” — and we test the five hypotheses we wrote down in section 1, before any model gets a vote.

4.1 Target variable analysis

# how imbalanced is the target?
churn_rate = df['churned_90d'].mean()                     # mean of 0/1 = share of 1s
print(f'churn rate : {churn_rate:.1%}  ({df.churned_90d.sum():,} of {len(df):,} members)')

ax = sns.countplot(x='churned_90d', data=df, hue='churned_90d',
                   palette=[COBALT, ACID], legend=False)  # bar per class
ax.set_xticklabels(['stayed', 'churned'])
ax.set_title('Target balance - roughly 4:1')
plt.show()

churn rate : 21.6%  (1,832 of 8,500 members)

22% positive class: imbalanced enough that accuracy is a trap (predicting “nobody churns” scores 78%), but not so extreme that we need resampling tricks. Stratified splits plus rank-based metrics will carry us.

4.2 Numerical variables

# histograms of every behavioural numeric in one grid
num_cols = ['age', 'tenure_months', 'orders_last_90d', 'avg_order_value',
            'days_since_last_order', 'sessions_per_month', 'avg_session_minutes',
            'email_open_rate', 'discount_share_orders', 'support_tickets_last_6m',
            'payment_failures_last_6m', 'nps_score']
df[num_cols].hist(bins=30, figsize=(14, 9), color=COBALT)  # one histogram per column
plt.suptitle('Distributions of the numeric features', y=1.01)
plt.tight_layout()
plt.show()

4.3 Categorical variables

# counts and churn rate side by side for the three big categoricals
fig, axes = plt.subplots(1, 3, figsize=(15, 3.8))
for ax, col in zip(axes, ['plan_type', 'acquisition_channel', 'payment_method']):
    rate = df.groupby(col)['churned_90d'].mean().sort_values()  # churn rate per level
    rate.plot(kind='barh', color=COBALT, ax=ax)                 # horizontal bars
    ax.axvline(churn_rate, color=INK, linestyle='--', linewidth=1)  # overall rate
    ax.set_title(f'churn rate by {col}')
    ax.set_xlabel('')
plt.tight_layout()
plt.show()

4.4 Relationships between variables

# correlation heatmap: numeric features vs each other and vs the target
corr_cols = num_cols + ['auto_renew', 'app_user', 'nps_missing', 'churned_90d']
corr = df[corr_cols].corr()                       # pairwise Pearson correlations

plt.figure(figsize=(11, 8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
            annot_kws={'size': 7}, cbar_kws={'shrink': 0.7})
plt.title('Correlation matrix')
plt.show()

# the single most useful line of EDA: correlation with the target, sorted
print(corr['churned_90d'].drop('churned_90d').sort_values())

orders_last_90d            -0.174521
auto_renew                 -0.151725
tenure_months              -0.124895
sessions_per_month         -0.120691
nps_score                  -0.120606
email_open_rate            -0.088462
avg_session_minutes        -0.059934
app_user                   -0.059823
age                        -0.015419
avg_order_value            -0.000101
discount_share_orders       0.036743
nps_missing                 0.046608
support_tickets_last_6m     0.119789
payment_failures_last_6m    0.174183
days_since_last_order       0.182337
Name: churned_90d, dtype: float64

Two things worth noticing beyond the churn correlations: total_spend_last_90d would correlate almost perfectly with orders × avg_order_value (it is built from them), and monthly_fee is just plan_type wearing a number — redundancy we will address in feature engineering.

4.5 Testing our hypotheses

Time to grade the five guesses from section 1.

# H1: payment failures are the strongest single signal
rate_by_fails = df.groupby('payment_failures_last_6m')['churned_90d'].agg(['mean', 'size'])
print(rate_by_fails)

rate_by_fails['mean'].plot(kind='bar', color=COBALT, figsize=(7, 3))
plt.axhline(churn_rate, color=INK, linestyle='--', linewidth=1, label='overall')
plt.title('H1 - churn rate by payment failures'); plt.legend(); plt.show()

                              mean  size
payment_failures_last_6m                
0                         0.182434  6786
1                         0.323491  1524
2                         0.533333   180
3                         0.375000     8
4                         1.000000     2

# H2: email engagement protects (only meaningful for members who GET emails)
opted = df[df['marketing_opt_in'] == 1].copy()             # opted-in members only
opted['open_bin'] = pd.cut(opted['email_open_rate'],       # cut = bucket into bins
                           bins=[0, .1, .25, .4, .6, 1],
                           labels=['0-10%', '10-25%', '25-40%', '40-60%', '60%+'])
print(opted.groupby('open_bin')['churned_90d'].mean().round(3))

open_bin
0-10%     0.325
10-25%    0.246
25-40%    0.190
40-60%    0.168
60%+      0.130
Name: churned_90d, dtype: float64

# H3: discount dependence - qcut = quartile buckets with equal member counts
df['discount_bin'] = pd.qcut(df['discount_share_orders'], q=4,
                             labels=['Q1 low', 'Q2', 'Q3', 'Q4 high'])
print(df.groupby('discount_bin')['churned_90d'].mean().round(3))

discount_bin
Q1 low     0.197
Q2         0.205
Q3         0.225
Q4 high    0.238
Name: churned_90d, dtype: float64

# H4: tenure protects
df['tenure_bin'] = pd.cut(df['tenure_months'], bins=[0, 6, 12, 24, 36, 60],
                          labels=['0-6m', '6-12m', '1-2y', '2-3y', '3y+'])
df.groupby('tenure_bin')['churned_90d'].mean().plot(
    kind='bar', color=COBALT, figsize=(7, 3))
plt.axhline(churn_rate, color=INK, linestyle='--', linewidth=1)
plt.title('H4 - churn rate by tenure'); plt.show()

# H5: plan tier - and its interaction with auto-renew
pivot = df.pivot_table(values='churned_90d', index='plan_type',
                       columns='auto_renew', aggfunc='mean')  # 2-way churn table
pivot.columns = ['auto_renew off', 'auto_renew on']
print(pivot.round(3))

sns.heatmap(pivot, annot=True, fmt='.1%', cmap='Blues')
plt.title('H5 - churn by plan and auto-renew')
plt.show()

           auto_renew off  auto_renew on
plan_type                               
Basic               0.320          0.190
Plus                0.277          0.177
Premium             0.274          0.123

Verdict on the hypotheses: all five hold, and H1 is dramatic — members with 2+ payment failures churn at several times the base rate. The pivot also reveals an interaction: auto-renew protects every plan, but Basic without auto-renew is the danger zone. Interactions like that are exactly what feature engineering should encode.

4.6 Subgroups

# churn across age bands - also a preview of who the model will affect most
df['age_band'] = pd.cut(df['age'], bins=[17, 25, 35, 45, 55, 81],
                        labels=['18-25', '26-35', '36-45', '46-55', '56+'])
band = df.groupby('age_band')['churned_90d'].agg(['mean', 'size'])
print(band.round(3))

band['mean'].plot(kind='bar', color=COBALT, figsize=(7, 3))
plt.axhline(churn_rate, color=INK, linestyle='--', linewidth=1)
plt.title('Churn rate by age band'); plt.show()

           mean  size
age_band             
18-25     0.239  1214
26-35     0.209  3676
36-45     0.222  2304
46-55     0.201   924
56+       0.199   382

5. Feature Engineering

5.1 The leakage trap

Before engineering anything in, check whether something needs throwing out. winback_email_sent is a CRM flag — and the win-back campaign is sent to members who already lapsed. Look how suspiciously good it is:

# crosstab with normalize='index' = row percentages
print(pd.crosstab(df['winback_email_sent'], df['churned_90d'], normalize='index').round(3))

# 'a feature that predicts the target this well is usually the target in disguise'
df = df.drop(columns=['winback_email_sent'])   # out it goes
print('winback_email_sent dropped')

churned_90d             0      1
winback_email_sent              
0                   0.924  0.076
1                   0.212  0.788
winback_email_sent dropped

72% of flagged members churned vs 8% of unflagged — because the flag is written after the churn we are trying to predict. In production this column would not exist at scoring time. Any time a feature looks too good to be true, check when it gets its value.

5.2 New features

Domain knowledge in, columns out: recency-frequency style ratios, the interaction EDA showed us, and one polynomial term.

# domain-driven constructions
df['orders_per_month'] = df['orders_last_90d'] / 3                    # frequency
df['value_per_session'] = df['total_spend_last_90d'] / (df['sessions_per_month'] * 3 + 1)
df['recency_share'] = (df['days_since_last_order'] /
                       (df['tenure_months'] * 30.4)).clip(upper=1)    # recency relative to lifetime

# the interaction EDA found: cheap plan AND no auto-renew
df['basic_no_renew'] = ((df['plan_type'] == 'Basic') & (df['auto_renew'] == 0)).astype(int)

# discount dependence x order frequency
df['discount_x_orders'] = df['discount_share_orders'] * df['orders_last_90d']

# one polynomial example: tenure's protective effect flattens out, square captures the curve
df['tenure_sq'] = df['tenure_months'] ** 2

print('engineered features added:', ['orders_per_month', 'value_per_session',
      'recency_share', 'basic_no_renew', 'discount_x_orders', 'tenure_sq'])

engineered features added: ['orders_per_month', 'value_per_session', 'recency_share', 'basic_no_renew', 'discount_x_orders', 'tenure_sq']

5.3 Encoding

Models eat numbers, not strings. Our categoricals are low-cardinality, so one-hot encoding via get_dummies is the right tool (target encoding earns its keep only at high cardinality).

# columns the model must never see: ids, dates, helper bins from EDA
drop_cols = ['member_id', 'signup_date', 'snapshot_date',
             'discount_bin', 'tenure_bin', 'age_band',
             'monthly_fee',               # redundant: it IS plan_type as a number
             'total_spend_last_90d', 'avg_order_value']  # replaced by their log versions

X = df.drop(columns=drop_cols + ['churned_90d'])   # features
y = df['churned_90d']                              # target

# get_dummies = one-hot encode; drop_first avoids the redundant reference column
X = pd.get_dummies(X, columns=['plan_type', 'acquisition_channel',
                               'payment_method', 'country'], drop_first=True)
X = X.astype(float)                                # uniform dtype for sklearn
print('final feature matrix :', X.shape)
print(list(X.columns))

final feature matrix : (8500, 40)
['age', 'auto_renew', 'tenure_months', 'orders_last_90d', 'days_since_last_order', 'sessions_per_month', 'avg_session_minutes', 'app_user', 'marketing_opt_in', 'email_open_rate', 'discount_share_orders', 'support_tickets_last_6m', 'avg_resolution_hours', 'payment_failures_last_6m', 'nps_score', 'nps_missing', 'log_spend', 'log_aov', 'orders_per_month', 'value_per_session', 'recency_share', 'basic_no_renew', 'discount_x_orders', 'tenure_sq', 'plan_type_Plus', 'plan_type_Premium', 'acquisition_channel_Marketplace', 'acquisition_channel_Organic Search', 'acquisition_channel_Paid Search', 'acquisition_channel_Referral', 'acquisition_channel_Social Media', 'payment_method_Card', 'payment_method_Klarna', 'payment_method_PayPal', 'country_Germany', 'country_Ireland', 'country_Netherlands', 'country_Spain', 'country_United Kingdom', 'country_spain']

5.4 Feature selection

With 30-odd features we are not drowning, but ranking them by mutual information (which catches non-linear relationships that correlation misses) tells us where the signal lives — and flags any dead weight.

from sklearn.feature_selection import mutual_info_classif   # MI scorer

mi = mutual_info_classif(X, y, random_state=RANDOM_STATE)    # MI of each feature vs target
mi_scores = pd.Series(mi, index=X.columns).sort_values(ascending=False)

mi_scores.head(15).sort_values().plot(kind='barh', color=COBALT, figsize=(8, 5))
plt.title('Top 15 features by mutual information')
plt.show()

# keep everything with measurable signal; drop the flatliners
selected = mi_scores[mi_scores > 0.001].index.tolist()
X = X[selected]
print(f'kept {len(selected)} of {len(mi_scores)} features')

kept 28 of 40 features

6. Model Selection

Three decisions, made before training so the results cannot seduce us:

Evaluation metric. ROC-AUC for ranking quality during model comparison, precision@150 as the business yardstick (the retention team’s monthly capacity). Accuracy is banned — see section 4.1.

Candidate models. A complexity ladder: logistic regression (fast, explainable baseline), a single decision tree (captures interactions, explainable), random forest (usually the accuracy sweet spot on tabular data, still gives feature importances — satisfying our explainability constraint).

Split strategy. One stratified 75/25 train/test split — stratified so both sides keep the 22% churn rate; the test set stays untouched until the final evaluation. Model comparison happens via 5-fold cross-validation inside the training set.

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold

# stratify=y keeps the churn rate identical in both splits
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=RANDOM_STATE)

print('train :', X_train.shape, f'churn {y_train.mean():.1%}')
print('test  :', X_test.shape, f'churn {y_test.mean():.1%}')

# the CV splitter every comparison below will share
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

train : (6375, 28) churn 21.6%
test  : (2125, 28) churn 21.6%

7. Model Training

7.1 Baselines

A model is only impressive relative to what the business does today. Two baselines: the naive majority-class dummy, and the retention team’s actual heuristic — “contact whoever hasn’t ordered in 60+ days”.

from sklearn.dummy import DummyClassifier
from sklearn.metrics import roc_auc_score

# naive baseline: always predict the majority class probabilities
dummy = DummyClassifier(strategy='prior').fit(X_train, y_train)
print('dummy AUC          :', round(roc_auc_score(y_test, dummy.predict_proba(X_test)[:, 1]), 3))

# the business heuristic, evaluated as precision@150 on the test set:
# rank test members by days since last order, take the top 150, count real churners
test_recency = df.loc[X_test.index, 'days_since_last_order']   # align by index
heuristic_picks = test_recency.sort_values(ascending=False).head(150).index
heuristic_p150 = y_test.loc[heuristic_picks].mean()
print(f'heuristic precision@150 : {heuristic_p150:.1%}')
print(f'random targeting would score the base rate: {y_test.mean():.1%}')

dummy AUC          : 0.5
heuristic precision@150 : 38.7%
random targeting would score the base rate: 21.6%

7.2 Comparing candidates

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# logistic regression needs scaled inputs -> scaler lives INSIDE the pipeline,
# so each CV fold fits the scaler on its own training portion (no leakage)
candidates = {
    'logistic regression': Pipeline([('scale', StandardScaler()),
                                     ('model', LogisticRegression(max_iter=1000))]),
    'decision tree': DecisionTreeClassifier(max_depth=6, random_state=RANDOM_STATE),
    'random forest': RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE),
}

results = {}
for name, model in candidates.items():
    # 5-fold cross-validated AUC on the training set only
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc')
    results[name] = scores
    print(f'{name:20s}: AUC {scores.mean():.3f} +/- {scores.std():.3f}')

logistic regression : AUC 0.754 +/- 0.013
decision tree       : AUC 0.692 +/- 0.013

random forest       : AUC 0.724 +/- 0.011

Read that table carefully, because it contains the most useful lesson in this notebook: the simplest model won. Complexity is not a ladder you must climb — on tabular data whose effects are mostly monotonic (more payment failures, more churn; longer tenure, less churn), a well-fed logistic regression is very hard to beat. It is also the most explainable candidate we had, which our stakeholders explicitly asked for.

7.3 Hyperparameter tuning

We tune both finalists with the same grid-search machinery — the same AUC metric, the same stratified 5-fold CV, or the comparison means nothing. Logistic regression gets its regularisation strength C tuned; the forest gets one fair shot at catching up via depth and leaf size (the two hyperparameters that control its overfitting).

from sklearn.model_selection import GridSearchCV

# finalist 1: logistic regression - tune the regularisation strength C
# (small C = strong regularisation = simpler model)
search_lr = GridSearchCV(
    Pipeline([('scale', StandardScaler()),
              ('model', LogisticRegression(max_iter=1000))]),
    {'model__C': [0.03, 0.1, 0.3, 1, 3]},   # model__C = parameter C inside the pipeline step 'model'
    cv=cv, scoring='roc_auc', n_jobs=-1)     # n_jobs=-1 = use all CPU cores
search_lr.fit(X_train, y_train)
print('logistic best params :', search_lr.best_params_)
print('logistic best CV AUC :', round(search_lr.best_score_, 3))

# finalist 2: random forest - can tuning close the gap?
search_rf = GridSearchCV(
    RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE),
    {'max_depth': [8, 12, None],         # how deep each tree may grow
     'min_samples_leaf': [5, 20, 50]},   # minimum members per leaf - bigger = smoother
    cv=cv, scoring='roc_auc', n_jobs=-1)
search_rf.fit(X_train, y_train)
print('forest best params   :', search_rf.best_params_)
print('forest best CV AUC   :', round(search_rf.best_score_, 3))

# the winner takes the test set - chosen on CV score, never on test performance
best_model = search_lr.best_estimator_ if search_lr.best_score_ >= search_rf.best_score_ \
             else search_rf.best_estimator_
print('selected:', 'logistic regression' if best_model is search_lr.best_estimator_ else 'random forest')

logistic best params : {'model__C': 3}
logistic best CV AUC : 0.754

forest best params   : {'max_depth': 8, 'min_samples_leaf': 5}
forest best CV AUC   : 0.74
selected: logistic regression

7.4 Final evaluation on the held-out test set

The test set has been quarantined since section 6. It gets used exactly once — now.

from sklearn.metrics import roc_curve, confusion_matrix, classification_report

test_proba = best_model.predict_proba(X_test)[:, 1]      # churn probability per member
test_auc = roc_auc_score(y_test, test_proba)
print(f'test ROC-AUC : {test_auc:.3f}')

# ROC curve: tradeoff between catching churners and false alarms
fpr, tpr, _ = roc_curve(y_test, test_proba)
plt.figure(figsize=(5.5, 4.5))
plt.plot(fpr, tpr, color=COBALT, linewidth=2, label=f'tuned logistic regression (AUC {test_auc:.3f})')
plt.plot([0, 1], [0, 1], color=INK, linestyle='--', linewidth=1, label='coin flip')
plt.xlabel('false positive rate'); plt.ylabel('true positive rate')
plt.title('ROC curve - test set'); plt.legend(); plt.show()

test ROC-AUC : 0.756

# confusion matrix at the default 0.5 threshold - useful context,
# though our real deployment decision is 'top 150', not a threshold
print(confusion_matrix(y_test, test_proba > 0.5))
print(classification_report(y_test, test_proba > 0.5, target_names=['stayed', 'churned']))

[[1591   76]
 [ 367   91]]
              precision    recall  f1-score   support

      stayed       0.81      0.95      0.88      1667
     churned       0.54      0.20      0.29       458

    accuracy                           0.79      2125
   macro avg       0.68      0.58      0.58      2125
weighted avg       0.75      0.79      0.75      2125

# the metric the retention team actually feels: precision in their 150 slots
top150 = pd.Series(test_proba, index=X_test.index).sort_values(ascending=False).head(150).index
model_p150 = y_test.loc[top150].mean()

print(f'model precision@150     : {model_p150:.1%}')
print(f'heuristic precision@150 : {heuristic_p150:.1%}')
print(f'random targeting        : {y_test.mean():.1%}')
print(f'lift over heuristic     : {model_p150 / heuristic_p150:.2f}x')

model precision@150     : 55.3%
heuristic precision@150 : 38.7%
random targeting        : 21.6%
lift over heuristic     : 1.43x

# which features drive the prediction - our explainability promise to the retention team.
# logistic coefficients on STANDARDISED features are directly comparable:
# positive = pushes churn risk up, negative = protects
coefs = pd.Series(best_model.named_steps['model'].coef_[0], index=X_train.columns)
top_coefs = coefs.reindex(coefs.abs().sort_values().tail(15).index)  # 15 largest by magnitude

colors = [COBALT if v > 0 else '#A39F8F' for v in top_coefs]   # risk factors vs protectors
top_coefs.plot(kind='barh', color=colors, figsize=(8, 5))
plt.axvline(0, color=INK, linewidth=1)
plt.title('Logistic regression coefficients (standardised) - blue raises risk, grey protects')
plt.show()

drivers = coefs.abs().sort_values(ascending=False).head(5).index.tolist()

# translate the lift into money - state every assumption out loud
CONTACT_SUCCESS = 0.30     # assumed: 30% of contacted true churners are saved
VALUE_PER_SAVE = 210       # assumed: GBP value of a saved member (fees + retained shopping margin)

model_saves = 150 * model_p150 * CONTACT_SUCCESS         # expected saves per month
heuristic_saves = 150 * heuristic_p150 * CONTACT_SUCCESS
extra_per_year = (model_saves - heuristic_saves) * VALUE_PER_SAVE * 12

print(f'expected saves/month - model     : {model_saves:.0f}')
print(f'expected saves/month - heuristic : {heuristic_saves:.0f}')
print(f'incremental value per year       : ~GBP {extra_per_year:,.0f}')

expected saves/month - model     : 25
expected saves/month - heuristic : 17
incremental value per year       : ~GBP 18,900

8. Documentation & Handoff

The analysis is only finished when someone else can use it and reproduce it. Three artefacts: the ranked list (for the retention team), a model card (for whoever inherits this), and the reproducibility details.

# the deliverable the retention team actually asked for: this month's list
handoff = df.loc[top150, ['member_id', 'plan_type', 'tenure_months',
                          'days_since_last_order', 'payment_failures_last_6m',
                          'auto_renew']].copy()
handoff['churn_probability'] = pd.Series(test_proba, index=X_test.index).loc[top150].round(3)
handoff = handoff.sort_values('churn_probability', ascending=False)
handoff.head(10)   # top 10 of the 150 - ids, risk score, and the 'why' columns

	member_id	plan_type	tenure_months	days_since_last_order	payment_failures_last_6m	churn_probability
8128	M06320	Premium	24	40	3	0.889
246	M05077	Basic	9	90	1	0.882
6300	M04923	Basic	2	32	2	0.873
3938	M07753	Plus	12	76	2	0.872
7944	M05265	Plus	15	62	2	0.854
263	M01818	Basic	4	90	0	0.827
2851	M02680	Basic	9	35	2	0.824
3589	M00818	Plus	4	32	3	0.820
2921	M06694	Basic	3	90	1	0.807
7654	M00750	Basic	1	31	2	0.793

# the model card - a one-screen honest summary that travels with the model
print(f'''
MODEL CARD - Brightcart Club churn ranker
=========================================
Intended use   : rank active members by 90-day churn risk; top 150/month
                 contacted by retention. Not for pricing or credit decisions.
Training data  : 8,500 members, snapshot 2025-12-31, churn window +90 days,
                 churn rate {churn_rate:.1%}. Simulated dataset (Datalad Code Along).
Model          : LogisticRegression (C={search_lr.best_params_["model__C"]}, standardised
                 features), chosen over a tuned random forest on CV AUC. Seed {RANDOM_STATE}.
Performance    : test ROC-AUC {test_auc:.3f}; precision@150 {model_p150:.1%}
                 vs heuristic {heuristic_p150:.1%} ({model_p150/heuristic_p150:.2f}x lift)
Top drivers    : {", ".join(drivers)}
Known limits   : label noise (lapsed members may reactivate); trained on one
                 snapshot - assumes a stable population; leakage column
                 winback_email_sent identified and excluded.
''')

MODEL CARD - Brightcart Club churn ranker
=========================================
Intended use   : rank active members by 90-day churn risk; top 150/month
                 contacted by retention. Not for pricing or credit decisions.
Training data  : 8,500 members, snapshot 2025-12-31, churn window +90 days,
                 churn rate 21.6%. Simulated dataset (Datalad Code Along).
Model          : LogisticRegression (C=3, standardised
                 features), chosen over a tuned random forest on CV AUC. Seed 42.
Performance    : test ROC-AUC 0.756; precision@150 55.3%
                 vs heuristic 38.7% (1.43x lift)
Top drivers    : tenure_months, payment_failures_last_6m, days_since_last_order, auto_renew, email_open_rate
Known limits   : label noise (lapsed members may reactivate); trained on one
                 snapshot - assumes a stable population; leakage column
                 winback_email_sent identified and excluded.

# reproducibility: versions and seed, so the numbers above can be regenerated
import sklearn, sys
print('python  :', sys.version.split()[0])
print('pandas  :', pd.__version__)
print('numpy   :', np.__version__)
print('sklearn :', sklearn.__version__)
print('seaborn :', sns.__version__)
print('seed    :', RANDOM_STATE)
print('data    : data/members.csv (regenerate with generate_churn_data.py, seed 7)')

python  : 3.9.6
pandas  : 2.3.3
numpy   : 2.0.2
sklearn : 1.6.1
seaborn : 0.13.2
seed    : 42
data    : data/members.csv (regenerate with generate_churn_data.py, seed 7)

Wrapping up

We went from a messy CSV to a defensible, explainable churn ranker:

Preprocessing fixed duplicates, label mess, sentinel outliers and three flavours of missingness — each treated according to its mechanism.
EDA confirmed all five pre-registered hypotheses and found the Basic-without-auto-renew interaction.
Feature engineering caught a leakage column that would have faked a 0.90 AUC, and built the features that earned their keep in the rankings.
The model comparison delivered the quiet classic: a tuned logistic regression beat the random forest — simpler, faster, and exactly as explainable as the stakeholders demanded.
The final model beats the team’s recency heuristic by a wide margin in the metric that matters — precision in their 150 monthly slots — and the £ translation makes the case in finance’s language.

GTM Single Page Application Tracking: Getting Pageviews Right When Pages Never Reload

GTM Zones: Managing Tags Across Teams, Partners and Multiple Sites

GTM Tag Sequencing: Controlling the Order Your Tags Fire

XGBoost: A Practical Guide to Extreme Gradient Boosting

Code Along: A Complete Churn Analysis in One Notebook