Inside the Brightcart Club Dataset: How I Built a Deliberately Messy Churn File

Most teaching datasets are too clean. Here is how I built the Brightcart Club churn file with planted duplicates, three kinds of missing data, and a deliberate leakage trap.

Most teaching datasets have one fatal flaw: they are too clean. Titanic, Iris, the Boston housing set, all of them arrive pre-scrubbed, which means a learner can jump straight to modelling and never practise the part of data science that actually consumes most of a working day. The dataset behind the Brightcart Club churn code-along was built to do the opposite. Every imperfection in it is planted on purpose, each one chosen to force a specific lesson.

A Fictional Membership Programme, Simulated From Scratch

Brightcart Club is invented, and so is every one of its 8,550 members. The file is generated by a Python script with a fixed random seed, so the data is fully reproducible and contains no real personal information. Each member gets a realistic profile: a signup date, a country, a plan tier (Basic, Plus, or Premium), a payment method, and a stack of behavioural features covering orders, spend, session activity, support tickets, and payment history over the 90 days before a snapshot date of 31 December 2025.

The target is whether the member churns in the 90 days after that snapshot, and it sits at roughly 22% positive. That number was chosen deliberately. It is imbalanced enough that plain accuracy becomes a misleading metric, which is the whole point, but not so extreme that the problem tips into rare-event territory and needs special handling. It is the imbalance a real subscription business actually sees.

Crucially, churn is not random. Behind the scenes, a latent risk score drives each member’s churn probability through a logistic function, and that score is built from genuine signals: payment failures push risk up hard, auto-renewal and longer tenure pull it down, low email engagement and heavy discount reliance push it up. This matters because it means the relationships in the data are real and recoverable. A good analysis will rediscover them, which is exactly what makes the code-along satisfying to work through.

Feel free to download this dataset and practice on it.

Every Mess Teaches Something

The interesting part is the planted imperfections, because each maps to a step in the data cleaning workflow. Around 50 exact duplicate rows are scattered in, so the learner has to find and remove them before doing anything else. The age column carries about 6% missing values plus sentinel errors of 999 and 0, the classic data-entry junk that masquerades as a real number. A handful of session-length values are negative, which is impossible and has to be caught. Country labels are deliberately inconsistent, with “United Kingdom”, “UK”, “uk”, and “U.K.” all appearing for the same place, and plan names arrive with stray casing and whitespace like “PLUS” and “basic “.

The most valuable lessons are in the missing data, because the file contains three different kinds of missingness on purpose. Age is missing completely at random, the easy case. Email open rate is missing structurally, because it only exists when a member opted into marketing, so blindly imputing a mean would be nonsense. And the NPS score, missing about half the time, is missing more often for at-risk members, which is missing-not-at-random and carries signal in the very fact of being absent. Three columns, three mechanisms, three correct responses, all in one file.

See you soon.

Add a Comment

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Datalad - Data Science and ML

Subscribe now to keep reading and get access to the full archive.

Continue reading