Readiness Score Confidence: When the Number Should Be Suppressed

A readiness score with no confidence value is an opinion, not a measurement. How to read credible intervals, spot low-confidence days, and know when to suppress.

Mac DeCourcy ·

Your dashboard says readiness 72. A green circle. A “good day to train” recommendation. What the dashboard doesn’t say is that your primary HRV source didn’t sync overnight, the backup source is running a 48-hour-old baseline, and the acute training load model is extrapolating from a partial sync of last week’s sessions. The 72 is plausible. It is not measured. It is a guess with a friendly UI.

This is what confidence values are for. Not as a nice-to-have UX detail, but as the part of the score that tells you whether to act on it. The composite scores with confidence pillar frames the broader problem. This post is a focused look at one piece of the solution: credible intervals on the readiness number, how to read them, and when the number should be suppressed entirely.


What a “Readiness 72” Actually Hides

Every composite readiness score is a weighted combination of physiological inputs — HRV, resting heart rate, sleep quality, acute training load, sometimes body temperature, respiration rate, or activity recovery — reduced to a single 0 to 100 value. The reduction is the point. The problem is that the reduction is one-way.

When a platform writes “72” to the screen, it has discarded three kinds of information:

  1. Which inputs contributed. A 72 built mostly from HRV is a different claim than a 72 built mostly from sleep duration. The user can’t tell them apart.
  2. How fresh each input was. An HRV reading from 4am today is not the same as an HRV reading pulled from a cached baseline five days old. The reduction treats them identically.
  3. How confident the model is in the number. A 72 computed from clean, complete, cross-validated data is a strong claim. A 72 computed from a partial fallback chain with one artifact-heavy night is a weak one. Same number, very different truth.

A confidence value is the reconstruction of the third piece. It is the answer to “how much should I trust this?” — and it belongs on the dashboard next to the score, not buried in a settings page or a chat transcript.

Point Estimates vs Distributions

The cleanest way to think about readiness is as a distribution, not a number. Your “true” readiness on any given day is a physiological state that no wearable measures directly. What the model produces is a best guess at where that state sits, plus an honest admission of how uncertain the guess is.

A point estimate is the single most likely value. A credible interval is the range of plausible values. Both come from the same underlying model; the platform just chooses which to surface.

Consider two days with the same point estimate:

  • Day A: readiness 72, 95 percent credible interval 68 to 76. Tight interval. The model is saying “I am quite sure you are around 72. The inputs agree, the baseline is stable, and no fallbacks fired.”
  • Day B: readiness 72, 95 percent credible interval 54 to 88. Wide interval. Same central value, but the model is saying “the number is somewhere between ‘rest day’ and ‘push hard.’ I don’t actually know.”

The point estimates are identical. The right decisions are not. Day A supports training. Day B supports waiting for better data or treating the number as a rough prior until the next sync.

Most consumer scoring surfaces show only the point estimate. That’s like reporting a weather forecast as “68 degrees” without mentioning whether the forecast error is 2 degrees or 18 degrees. For any decision more consequential than “should I wear a jacket,” the error band matters as much as the mean.

Where Confidence Comes From

Confidence isn’t magic. It’s the aggregation of a handful of well-defined signals, each of which can degrade the model’s certainty:

Freshness. If an input hasn’t been refreshed within its expected cadence — overnight HRV within the last 24 hours, acute training load within the last 48, sleep within the last 18 — the most recent value is an extrapolation. Extrapolations widen intervals.

Baseline history length. Personalized thresholds like p25/p50/p75 need enough data to be stable. A 7-day baseline is noisy. A 30-day baseline is reasonable. A 90-day baseline lets the model separate your real variability from sampling noise. Short histories widen intervals.

Artifact rate. Every overnight HRV window comes with a fraction of beats that were filtered as sensor noise, motion, or ectopics. A clean night might have 2 percent rejected. A rough night might have 40 percent rejected. The derived HRV from a 40-percent-rejected window is structurally less reliable — a detail that rarely surfaces in consumer UIs and always should.

Cross-source agreement. When two sources measure the same quantity and disagree by more than their expected tolerance, either one or both is wrong. The model doesn’t know which. It widens the interval. Cross-source divergence handling is the full topic.

Biological plausibility. If tonight’s combination of inputs is unusual — unusually low RHR with unusually high HRV and unusually short sleep, say — the model treats it as possible sensor error rather than a physiological tour de force. Implausibility widens intervals.

A Bayesian-style approach formalizes this. Each input is modeled as a distribution over plausible values, the priors encode typical uncertainty, and the composite inherits the variance. Simpler weighted-confidence approaches work too: each input gets a reliability score between 0 and 1, weighted by its contribution, and the aggregate drives a final confidence value. The specific math doesn’t matter for users. What matters is that the interval reflects the reliability of the inputs today, not a static uncertainty baked into the model.

When to Suppress the Score

Once you have a credible interval, you have a clear answer to a question most platforms dodge: should this number be shown at all?

Suppression — refusing to display a score below some confidence threshold — is the most honest interaction a health dashboard can offer. A low-confidence 72 is not more useful than no number. It is worse, because users read the rendered digit as an assertion. The credible interval might say 50 to 90, but the UI fed them a single definitive-looking value.

The right default is: if the interval is wider than a threshold, show “insufficient data” plus an explanation, not a number.

Thresholds that work in practice:

  • Interval width above 25 points on a 0 to 100 scale. Wider than that, the number is less precise than “low / medium / high” categories would be, so rendering a digit adds false precision.
  • Aggregate input confidence below 40 percent. Below this, at least one high-weight input is missing, stale, or artifact-heavy.
  • Fewer than 14 days of baseline history. Personalized thresholds aren’t stable yet. Use a loose population prior and say so.

The suppression frequency target is roughly 5 to 15 percent of days for an active user with normal wearable coverage. Below 5 percent, suppression is probably too permissive. Above 15 percent, the feature feels broken and users disable it. The right frequency depends on the user’s actual data quality — someone who travels and lets their ring die four nights a month should see more suppressed days than someone with continuous wear.

Suppression should always explain itself. “We couldn’t compute readiness today because your primary HRV source didn’t sync and your backup is running a stale baseline. Try again in 2 hours or sync manually.” That’s a UX pattern that respects the user’s intelligence and doesn’t train them to ignore warnings.

The Anatomy of a Bad Day

Consider a specific failure mode: the user lands on a red-eye flight, their ring died somewhere over the ocean, they slept four hours in a hotel bed they arrived at at 3am, and their training schedule says deadlifts.

A naive scoring system takes whatever data trickled in, runs it through the normal pipeline, and produces a number. Maybe it’s low — a 45 — which the user interprets as “my body is telling me not to train.” Maybe it’s high — a 78 — because the four hours they did sleep were measured as high-HRV deep sleep and the truncated night got normalized away. Either number is being treated as a measurement of their physiological state when it’s really a measurement of how well the pipeline handled a degraded input set.

A confidence-aware system on the same day:

  • Freshness: primary source stale by 18 hours. Degraded.
  • Baseline: 30-day baseline holds, but the last 2 days are missing. Slightly degraded.
  • Artifact rate: unknown for tonight because the primary source didn’t capture. Fallback source caught the night but has higher typical artifact rate. Degraded.
  • Cross-validation: only one source present. Cannot validate.
  • Biological plausibility: sleep duration 4 hours is 2.5 standard deviations below baseline. Unusual, but not implausible given travel context. Borderline.

Aggregate confidence: low. The right UX is “readiness suppressed — 4 hours of sleep after a red-eye plus a single-source night means we can’t compute a trustworthy number. Recommend treating today as a restoration day regardless of what the data would have said.”

That message tells the user something useful. It acknowledges the physiological reality (sleep debt, travel) without pretending a number can summarize it. And it protects them from the failure mode where the pipeline produces a misleadingly high or low value that they act on.

Reading a Confidence Graph Over Time

A single day’s confidence value is useful. A confidence trace over weeks is more useful, because it surfaces systematic problems in the data pipeline that individual days miss.

Patterns that show up in long-horizon confidence:

Weekly dip on the same day. If confidence drops every Sunday, that’s probably a sensor hygiene pattern — the device not being charged overnight after weekend activities. Fix: add a sync reminder or a second source for Sunday night.

Slow degradation over weeks. Gradually widening intervals usually mean a primary source has started failing intermittently — a ring with worn-out contacts, a watch with a dying battery, a strap that’s losing tightness. The point estimates might still look reasonable for a while, but the intervals widen, and the confidence trace catches what the score-only view would miss until a catastrophic failure.

Persistently wide intervals. If the interval never narrows below some floor, the user probably doesn’t have enough data sources for the model to be confident. This is a normal state for users with one device. The model should still produce usable numbers — it just needs to be honest that the uncertainty is structural, not transient.

Seeing the trace helps the user calibrate their own trust in the score. A week of high-confidence 70s followed by a low-confidence 42 should be treated very differently from a week of mid-confidence 55 to 65 followed by another mid-confidence 60. The trajectory matters, and it’s invisible without confidence rendering.

Designing the UI Around Uncertainty

A dashboard built for confidence doesn’t just add a small confidence badge next to the number. It treats the number itself differently based on the interval.

Practical patterns:

  • Render the interval, not just the point. “Readiness 72 (range 64 to 80)” or a visual bar showing the plausible range with the central value marked. The bar’s width is the confidence.
  • Color the number by confidence. A saturated color for narrow intervals. A desaturated, slightly faded rendering for wide ones. The user learns to read the visual weight as trust weight.
  • Show the input breakdown on tap. Each contributing input with its own freshness, artifact rate, and weight. The aggregate confidence becomes legible rather than opaque.
  • Surface the dominant uncertainty driver. “Confidence low today because HRV didn’t sync” is more useful than “confidence 38 percent.” The user gets a concrete fix.
  • Log suppressions visibly. The history shouldn’t just show days with numbers. It should show days where the number was suppressed and why, so the user can see the platform is behaving consistently.

Designing around uncertainty costs visual density — it’s easier to make a dashboard feel definitive than to make it feel honest. The tradeoff is that a definitive-feeling dashboard accumulates small trust breaches every time the user looks at a number that turned out to be nonsense. An honest dashboard builds trust by being boring when it doesn’t know — and right when it does.

Where This Fits in the Rest of the System

Confidence on readiness is the user-visible tip of a larger pipeline. For the interval to be correct, the inputs have to carry their own uncertainty: artifact detection has to propagate rejection rates, source fallback has to degrade confidence when the primary is missing, cross-validation has to flag divergence. Any one of those missing and the interval is cosmetic.

Source fallback chains handle the case where one input is late. Cross-source validation handles the case where two inputs disagree. Artifact cleanup handles the case where the raw signal is corrupted. Personalized thresholds handle the case where the baseline itself is wrong. The readiness credible interval is where they all combine.

If you’re building this from scratch, the order of operations matters. Confidence can’t be bolted on at the end; it has to be present in every input path. A platform that adds a “confidence 80 percent” badge without having done the upstream work is lying politely. The number on the badge is less trustworthy than the readiness score it’s trying to qualify.

For more context on how this fits with the rest of Omnio’s scoring approach, see the pillar on composite scores with confidence and the feature page on composite health scores. The primer what is a composite health score covers the basics. Which wearable is most accurate looks at the per-device accuracy literature that feeds into realistic per-source confidence priors. And if you want to understand how this interacts with training decisions rather than just scoring, see the adaptive training intelligence guide.


Putting It Together

The short version:

  • A readiness score is a point estimate pulled from a distribution. The interval around it matters as much as the number itself.
  • Confidence is the aggregation of input freshness, baseline history, artifact rate, cross-source agreement, and biological plausibility. Each degrades the interval in predictable ways.
  • When the interval is too wide, the right UX is suppression plus explanation — not a rendered digit that looks more certain than the data supports.
  • Confidence over time is a diagnostic signal in its own right. Patterns in the confidence trace surface data-pipeline problems that day-to-day score views miss.
  • Designing around uncertainty means rendering the interval, coloring by confidence, showing the breakdown, surfacing the driver, and logging suppressions as first-class history.

A number without a confidence value is an opinion. A number with one is a measurement — worth acting on when it’s sharp, worth deferring to better data when it isn’t. The job of a readiness score isn’t to produce a number every day. It’s to tell you, every day, how much today’s number is worth.