Why We Built a Bayesian Brain for Your Training Plan

Every fitness app says it “learns.” We wanted to prove it.

The Problem with “Smart” Training Apps

Open any fitness app that claims to adapt to you. Train for a month. Then ask yourself a simple question: what, specifically, has it learned?

You won’t get an answer. Not because the information is hidden — because it doesn’t exist. Most “adaptive” training apps are just rule engines wearing a lab coat. They have hardcoded thresholds masquerading as intelligence:

HRV below X? Suggest a rest day.
Completed all sets? Add 5 lbs next week.
Missed two sessions? Send a motivational push notification.

That’s not learning. That’s a flowchart. And for a while, our system was a flowchart too.

What We Already Had (And Why It Wasn’t Enough)

Before Phase 5, we’d built something genuinely useful. A readiness engine that fused HRV, sleep, resting heart rate, and training load into a single daily score. Biometric gates that could flag when your body was sending distress signals. An adaptation engine that adjusted your workout in real time — swapping heavy squats for light recovery work when your readiness tanked. A nutrition layer that understood you shouldn’t be doing PR attempts in a caloric deficit. A subjective feedback system that let your body have a vote alongside your wearable.

It worked. Users liked it. But it had a fundamental limitation: it treated every human identically.

The system knew that on average a heavy session takes about 36 hours to recover from. It knew that in general a 10% volume jump week-over-week is safe. What it didn’t know was anything about you. Maybe you’re a fast recoverer. Maybe you handle high volume like it’s nothing but crumble under intensity. Maybe you report RPE 7 for what your heart rate says is an RPE 9.

We had a population-level model pretending to be personalized. The readiness score was real. The adaptation was real. But the parameters driving it were borrowed from textbooks and averages. We were running everyone through the same transfer function and hoping the inputs were different enough to produce different outputs.

That’s fine for month one. It’s not fine for month six.

The Obvious Wrong Answer

The obvious move is to throw a neural net at it. Collect a bunch of training logs, sleep data, and outcomes. Train a model. Deploy it. Blog about your “AI-powered coaching engine.”

We looked at this seriously for about two weeks before killing it. Here’s why:

You don’t have enough data. A dedicated user trains 4–5 times per week. That’s ~20 data points per month. A neural net needs thousands of examples to learn anything useful. By the time you have enough data to train a decent model, the user has changed — they’re fitter, or injured, or training for a different goal. The data that would have been useful six months ago is actively misleading now.

You can’t explain it. When a neural net says “rest today,” you can’t ask why. You can’t tell the user “your recovery rate is 15% faster than average — here’s the evidence.” You just get a number. Trust us. Users — especially the kind of users who track their HRV — don’t trust black boxes.

You can’t roll it back. When (not if) the model makes a bad recommendation, you need to undo the damage. With a neural net, you retrain. With the approach we chose, you revert a single parameter to its previous version and keep everything else.

You can’t test it. How do you write a unit test for “the neural net learned that this user recovers fast”? You can’t. You test the whole model or nothing. That’s not engineering, that’s vibes.

Bayes, Specifically

We needed a learning system that works with small data, explains itself, fails gracefully, and can be tested with pytest. So we went with the oldest trick in the book: Bayesian parameter estimation.

The idea is embarrassingly simple. For every parameter the system needs to personalize, we do three things:

Start with a prior. What does the average person look like? Sports science has answers for recovery rates, volume tolerance, and more. We start there.
Observe. Every time we make a prediction and see the outcome, we record both. You trained heavy on Monday, and your HRV returned to baseline in 28 hours? That’s an observation.
Update. Bayes’ rule combines the prior with the observation to produce a posterior — a new, slightly more personalized estimate. Do this 15 times and your recovery model has meaningfully departed from the population average. Do it 50 times and it’s your model.

The math is a Normal-Normal conjugate update. It has a closed-form solution. No gradient descent, no training loop, no GPU. It’s arithmetic that runs inline on every API request — a weighted average of the prior and your observations, where the weights are determined by how much data you have and how noisy that data is.

With zero observations, the posterior equals the prior. The system recommends exactly what a sports science textbook would. With five observations, the posterior starts drifting toward your data — but the prior still has a strong vote, because five data points isn’t much. With fifty observations, the prior is nearly irrelevant. The model is yours.

This means the system is never worse than a textbook. It starts at “evidence-based generic” and converges toward “evidence-based personal.” There’s no cold start problem in the traditional sense — the cold start is the textbook, and textbooks are pretty good.

We Didn’t Invent This

We want to be clear about something: there is nothing novel here. This exact pattern — population prior, per-user Bayesian posterior, confidence-gated automation — is battle-tested across some of the most successful adaptive systems in production today.

Duolingo uses Bayesian forgetting curves to model how quickly you forget vocabulary. They start with a population-level decay rate and update it per-user as you get flashcards right or wrong. Same math. Same conjugate update. They’ve published papers on it.

Spotify and Netflix use Thompson sampling — a Bayesian bandit algorithm — to balance exploration and exploitation in recommendations. Should we show you something we know you like, or something new that might expand your taste? They maintain per-user posterior distributions over preference parameters to make that call.

Clinical decision support systems use Bayesian updating to adapt population-level treatment guidelines to individual patients. When you only have a handful of lab results for one person, you can’t train a neural net. But you can update a prior. Hospitals do this every day.

Adaptive testing platforms like the GRE and GMAT use Item Response Theory — which is, under the hood, Bayesian estimation of a latent ability parameter. After each question you answer, the system updates its belief about your skill level and picks the next question accordingly. By question 20, it has a tight posterior on your ability without needing a 200-question exam.

We’re not doing anything these systems haven’t proven at scale. We’re just pointing the same math at your HRV data instead of your vocabulary retention or your movie preferences.

Six Models, Not One

One of the earliest decisions we made was to refuse the temptation of a single unified model. Instead, we built six independent sub-models, each responsible for learning one thing:

Model	What It Learns
Recovery Curve	How fast you bounce back from sessions
Volume Tolerance	How much training volume you can handle per muscle group
RPE Calibration	Whether your perceived effort matches your actual output
Readiness Prediction	What your readiness will be tomorrow given today’s signals
Schedule Preference	When you actually show up vs. when you say you will
Nutrition Impact	How your eating patterns affect your recovery

Some models personalise within weeks, others take months — it depends on how much signal they need. Each model maintains its own state independently. You can test the recovery curve model without touching the volume model. You can reset the nutrition model after a user changes their diet without throwing away four months of recovery data.

This matters more than it sounds like it matters. In production, things go wrong one at a time. A user gets a new sleep tracker that reports different HRV values. That might blow up the readiness prediction model. In a monolithic system, that corruption spreads everywhere. In ours, one model gets reset to its prior and re-learns in a month. The other five are untouched.

The Confidence Gate

Here’s the thing about Bayesian models that makes them uniquely suited for this: they know what they don’t know.

The posterior distribution isn’t just a point estimate — it has a width. A wide posterior means “I don’t have enough data to be sure.” A narrow one means “I’m fairly certain about this.” We collapse that width into a confidence score between 0 and 1, and we use it as a gate.

If a model’s confidence is above a threshold, its predictions are used to drive the adaptation engine. Below that, the system falls back to the rule-based defaults from Phases 1–4.

This is the key safety property: Phase 5 can never make the system worse than Phase 4. The learning layer proposes. The confidence gate decides whether anyone should listen. When a model is new and uncertain, the battle-tested rule engine stays in control. As data accumulates and confidence rises, the Bayesian model gradually takes the wheel.

No flag day. No moment where we flip from “rule-based” to “AI.” Just a smooth, user-by-user, parameter-by-parameter handoff as confidence accrues.

We Can Prove It Works (Or Prove It Doesn’t)

This is the part we’re proudest of, and the part most “adaptive” systems skip entirely.

Every prediction the system makes is logged: what we predicted, how confident we were, and when we expect to know if we were right. The next day (or the next week, depending on the prediction horizon), we check. Was HRV actually back to baseline in 28 hours? Did the user actually feel recovered? Did their readiness actually improve after the deload?

Then we compare our accuracy against baselines:

Naive persistence: “Tomorrow’s readiness equals today’s.” This is the laziest possible model, and it’s surprisingly hard to beat.
Population mean: “Predict the average.” This is what you get from a non-personalized system.

If our Bayesian model can’t beat “tomorrow equals today,” it’s turned off. That’s it. No excuses, no “it’ll get better with more data” hand-waving. Beat the baseline or defer to the rule engine.

After a month with a typical user, both the readiness prediction and recovery curve models meaningfully beat their baselines. That might not sound exciting, but it’s the difference between catching a bad readiness day in advance versus reacting to it after the user has already dragged themselves through a workout they shouldn’t have done. Recovery predictions improve fastest — because recovery rate varies more between individuals than day-to-day readiness does.

We show these numbers to users. Right in the app. Per-model accuracy compared to the generic baseline. Transparency builds trust, and trust matters when you’re telling someone to skip a workout.

The Safety Envelope

We don’t trust our models. That’s a feature.

Sitting outside the entire learning system is a safety envelope — an immutable set of hard bounds that no model output can violate, regardless of what it learns:

Weekly volume increases are capped, even if the model thinks you can handle more.
You get mandatory rest days, even if your readiness is perfect.
After sustained loading without a break, you get a deload. Period. Connective tissue doesn’t send HRV signals.
If your resting heart rate is significantly elevated above baseline, high-intensity work is blocked. The model doesn’t get a vote.

This is the same pattern used in autonomous vehicles and clinical decision support: the learning layer operates inside a sandbox. The sandbox has walls. The walls don’t move.

It sounds conservative, and it is. We’d rather leave 5% of performance on the table than send a user into an overtraining spiral because a model got overconfident. Fitness isn’t a game with a reset button.

Exploration: Learning What You Don’t Prescribe

There’s a subtle problem with learning from your own recommendations. If the system always deloads when readiness drops below 60, it never discovers that this particular user performs fine at 55. It only sees the world it creates.

So occasionally, the system deliberately deviates slightly from its “optimal” prescription. A little more volume here. A little less intensity there. Always within tight bounds, never in the red zone, never back-to-back with another exploration session.

Then it watches what happens. Did the extra volume cause a readiness crash? Or did the user absorb it like nothing happened? That observation goes into the model, and now the system knows something it couldn’t have learned by always playing it safe.

This is a contextual bandit — the same explore/exploit pattern Spotify uses to discover whether you’d like a song you’ve never heard. You don’t need a thousand exploration trials. You need a few dozen well-chosen ones, spread over months, within a safety envelope. Each one incrementally expands the system’s map of your capabilities.

What Users See

We thought hard about how to surface all of this without overwhelming anyone. The answer: show confidence, show baselines, show what the system has learned in plain language.

On the training tab, there’s an “Intelligence” section. It shows:

Per-model accuracy compared to baselines (“Your recovery predictions: 78% accurate vs. 61% generic”)
Confidence bars (so you know which models are mature and which are still learning)
Plain-language insights: “You recover about 15% faster than average” or “Low sleep hits your training harder than most people”
Compliance stats and mesocycle position

No charts of posterior distributions. No Bayesian jargon. Just: here’s what your system knows, here’s how sure it is, and here’s the evidence.

If the system hasn’t learned enough to say anything useful — which is the case for the first couple weeks — it says so. “Still learning. Using evidence-based defaults.” Honesty over theater.

The Bet We’re Making

We’re betting that a simple model you can explain, test, and trust is worth more than a complex model you can’t. We’re betting that users who track biometrics want to understand their body, not just follow instructions from an oracle. We’re betting that straightforward Bayesian math, applied patiently over months, with safety rails and honest accuracy reporting, will produce better outcomes than any black-box model trained on someone else’s data.

Laplace figured out the math in 1774. Duolingo proved it works for learning. Spotify proved it works for discovery. The GRE proved it works for assessment. We’re proving it works for training.