Can Your Health AI Prove It Isn't Lying to You?

Ask your health AI what the research says about HRV and recovery. You’ll get an answer — fluent, confident, authoritative. But could it prove any of it? Could you trace a single claim back to a real paper?

We ran this test across three platforms. The results reveal two distinct ways health AI fails — and why both erode your trust.

The Test

We asked Oura Advisor, WHOOP Coach, and Omnio AI the same question:

“What does the research say about ideal HRV ranges for recovery?”

This isn’t a trick question. It’s the kind of thing anyone with a wearable might reasonably ask. The answers were revealing — not because any platform got it catastrophically wrong, but because of what each answer couldn’t do.

Oura Advisor

View Oura Advisor's response

Oura Advisor response to "What does the research say about ideal HRV ranges for recovery?"

“Research generally shows that there isn’t a single ideal HRV number that applies to everyone. What matters most is how your HRV behaves relative to your own usual range…”

Oura’s answer is careful and conversational. It mentions “recovery studies” in passing but doesn’t cite a single one. No author names. No PMIDs. No evidence tiers. It’s a reasonable summary of conventional wisdom — the kind of thing you could get from a wellness blog — wrapped in a tone that implies clinical authority.

The response ends by turning the question back on you: “If you look at your own nights recently, you might notice whether your HRV feels settled or fluctuating.” You asked the AI what the research says. It asked you to look at your own data and figure it out.

What’s missing: Any traceable research citation. Any specific finding. Any way to verify the claims.

WHOOP Coach

View WHOOP Coach's response

WHOOP Coach response to "What does the research say about ideal HRV ranges for recovery?"

“Research shows that ‘ideal’ HRV ranges are highly individual and can vary by age, fitness, genetics, and even time of day… For recovery on WHOOP, what matters most is how your nightly HRV compares to your own 30-day baseline — not a generic standard.”

WHOOP’s answer hits the same notes as Oura’s: HRV is personal, focus on your baseline, higher-than-baseline means recovered. Also zero citations. Also no research referenced by name. The tone is confident and coaching-oriented, but there’s nothing a user could verify against PubMed.

What’s missing: Same as Oura — no citations, no specific findings, no evidence hierarchy. Generic guidance dressed as research-informed advice.

Omnio AI

View Omnio AI's response

Omnio AI response to "What does the research say about ideal HRV ranges for recovery?"

Omnio’s response opens with the same grounding — individual baselines matter — but then does something neither competitor can: it cites actual research.

The response includes specific findings from named studies, labeled by evidence tier (meta-analysis, systematic review, RCT), with PMIDs that resolve to real papers on PubMed. It distinguishes between what a meta-analysis found versus what an observational study suggests. And it integrates the user’s actual HRV data with the population-level findings — showing where their personal trends align with or diverge from the research.

Every citation is verifiable. Every PMID resolves. The evidence hierarchy is explicit.

The difference isn’t that Omnio’s AI is smarter. It’s that the system behind it is designed so the AI can’t cite what it hasn’t verified.

The Two Ways Health AI Fails

These screenshots illustrate something important: health AI doesn’t just fail by making things up. It fails in two distinct ways, and most products fall into one camp or the other.

Failure Mode 1: Fabrication

General-purpose LLMs — the kind powering most chatbots — hallucinate citations. Ask ChatGPT or Gemini about HRV and sleep, and you’ll get confident references to “Chen et al. (2023)” with fabricated PMIDs and invented statistics. The studies sound real. The numbers are plausible. None of it checks out.

This is the more dangerous failure mode. A fabricated citation creates false confidence. Users change behavior based on nonexistent findings, share fake studies with others, and build mental models of their health grounded in fiction. If they bring AI-generated “research” to a healthcare provider, it erodes the provider’s trust in their engagement.

Failure Mode 2: Vacuousness

Oura and WHOOP have clearly learned from the fabrication problem — their AIs don’t hallucinate citations because they don’t cite anything at all. Instead, they produce vague, unfalsifiable guidance that sounds authoritative but contains no verifiable claims.

This is safer than fabrication, but it’s not honest either. When a user asks “what does the research say,” they’re asking for research. Responding with generic coaching advice — “focus on your baseline,” “HRV is personal” — without citing a single study is a non-answer wearing the costume of an answer. The clinical tone and coaching framing create an impression of evidence-backed authority that the content doesn’t support.

The Common Thread

Both failure modes share the same root cause: the AI is more confident than its evidence base. Fabrication fills the gap with invented citations. Vacuousness fills it with authoritative-sounding platitudes. Neither gives the user what they actually need — verifiable evidence they can trace back to source.

Most health AI products treat this as a prompt engineering problem: “Please don’t hallucinate.” That doesn’t work. The model doesn’t know what it doesn’t know. You have to build the constraint into the system, not the instruction.

What We Built: Evidence-First Intelligence

We took the position that our AI should never be more confident than its evidence base. If it can’t find a study, it says so. If the best available evidence is observational rather than a randomized controlled trial, it tells you that too. And if you ask what the research says, you get actual research — not coaching platitudes.

Here’s what that required:

A Curated Medical Literature Corpus

We don’t let the AI search the open internet for studies. Instead, we maintain a curated corpus of peer-reviewed medical literature spanning 15 health domains — sleep, HRV, cardiovascular health, nutrition, body composition, VO2 max, glycemic control, metabolic health, stress, recovery, sports science, bloodwork, wearable validation, blood oxygen, and environmental factors.

Papers are sourced from PubMed and go through a multi-stage quality gate before they enter the system. Not everything that gets published is worth citing. We filter for:

Relevance — Is the paper actually about one of our domains, or does it just mention it in passing?
Evidence tier — Meta-analyses and systematic reviews surface first. RCTs rank above cohort studies. Observational research is labeled as such. We don’t hide the hierarchy.
Actionability — Does this research translate to something a person can actually do? Pure bench science, animal models, and narrow clinical populations (e.g., post-surgical pediatric patients) are excluded.
Retraction status — Papers that have been retracted are automatically flagged and removed. You’d be surprised how many AI systems still cite retracted studies.

The result is a knowledge base of thousands of quality-gated papers, organized by domain and evidence tier, kept current with periodic re-ingestion.

Citation Verification at the System Level

This is the most important piece. When our AI generates a response, it can only cite papers that were actually retrieved from the corpus for that specific query. A dedicated verification layer inspects every citation the model produces — every PMID, every URL, every author-year reference — and strips anything that doesn’t match a retrieved paper.

If the LLM hallucinates a citation, it gets caught. Not sometimes. Every time. It’s not a heuristic or a best-effort filter — it’s a deterministic check against the retrieved evidence set.

This means the worst case for an Omnio response is that a citation gets removed, leaving a less-detailed but still accurate answer. The worst case for a system without this layer is that a fabricated citation reaches the user and gets treated as real.

Evidence Tier Transparency

We don’t just cite studies — we tell you what kind of study it is:

Label	What it means
Meta-Analysis	Aggregated results from multiple studies. Strongest evidence tier.
Systematic Review	Structured review of all available evidence on a question.
RCT	Randomized controlled trial. Gold standard for causal claims.
Cohort Study	Observational, tracking a group over time. Good for associations, weaker for causation.
Observational	Cross-sectional or case-control. Useful for hypothesis generation, not strong conclusions.

When the AI says “a meta-analysis of 23 studies found…” — you can verify that it’s actually a meta-analysis. When it cites an observational study, it’s labeled accordingly. The user gets to calibrate their confidence to the evidence quality, not just to the AI’s tone.

Graceful Degradation, Not Graceful Fabrication

When our retrieval system can’t find strong evidence for a question, the AI doesn’t fill the gap with plausible-sounding nonsense. It tells you: “I don’t have strong peer-reviewed evidence for this specific question.”

This is a product decision. A fluent, confident answer that’s wrong is worse than a shorter, honest answer that acknowledges its limits. Most health AI products optimize for engagement — longer, more detailed, more authoritative-sounding responses. We optimize for accuracy, even when it means saying less.

Why “Don’t Hallucinate” Isn’t a Solution

It’s tempting to think you can solve hallucination with better prompts. “Only cite real studies.” “Don’t make up references.” Every major LLM provider recommends this approach.

Some products have taken this to its logical conclusion — Oura and WHOOP appear to have instructed their models to avoid citations entirely. This eliminates fabrication, but it also eliminates evidence. You’ve traded one failure mode for another.

Prompt-based constraints don’t work for three reasons:

The model doesn’t have a ground truth to check against. When it generates “Smith et al. (2023),” it’s not looking up a database — it’s predicting the next plausible token. It can’t distinguish between a real memory and a statistically plausible fabrication.
Instruction-following degrades under load. As conversations get longer and context fills up, models are more likely to take shortcuts. The “don’t hallucinate” instruction gets outweighed by the pattern-matching pressure to produce citation-shaped text.
Suppressing citations doesn’t build trust — it just hides the problem. If your solution to “the AI might cite fake studies” is “don’t cite any studies,” you haven’t solved the trust problem. You’ve just made it invisible. The user still has no way to evaluate the quality of the advice they’re receiving.

The only reliable approach is architectural: give the AI access to real evidence, constrain it to only cite what it retrieved, verify what it produces, and strip what it can’t prove. Prompt engineering is a speed bump. System design is a wall.

Trust as the Product

Health platforms have historically competed on features — more metrics, more dashboards, more integrations. We think the next differentiator is trust.

Not trust in the brand sense — trust in the mechanical sense. Can you trace a piece of advice back to its source? Can you verify the evidence tier? Can you confirm the citation is real? When the AI says “based on your data,” can you see the actual data it used?

The current landscape offers you a choice between AI that invents evidence and AI that avoids evidence entirely. We don’t think either is acceptable. If you ask what the research says, you should get research — real papers, real PMIDs, real evidence tiers — not fabricated citations and not vague coaching advice.

We’re building for a future where health AI is held to the same evidentiary standard as any other health claim. The platforms that can demonstrate that standard — not just promise it — will be the ones that earn long-term user trust.

Omnio is building health intelligence you can actually verify. Join the waitlist at getomn.io.