Which Wearable Is Most Accurate? What 17 Validation Studies Actually Found

We reviewed 17 peer-reviewed studies comparing Oura, Apple Watch, Fitbit, Garmin, WHOOP, and Samsung across sleep, heart rate, HRV, SpO2, VO2 max, and more. Here are the specific numbers.

Mac DeCourcy · · Updated April 3, 2026

Most “which wearable is best?” articles give you the same non-answer: it depends on what you value. Then they link affiliate products and move on.

We think you deserve the actual data. We dug into 17 peer-reviewed validation studies — three independent sleep studies using polysomnography, a 536-night HRV comparison against chest-strap ECG, multi-device SpO2 and VO2 max validations, and several manufacturer-funded studies where the funding itself is part of the story.

What we found is that no device wins everywhere, but the gaps between devices are larger than most people realize — and which device leads depends entirely on which metric you care about. Every study’s funding is disclosed below so you can judge the evidence yourself.


How to Read These Studies

Before the data, three things determine whether a study result means anything:

  1. Reference standard — Sleep studies should use polysomnography (PSG). Heart rate should compare against ECG (e.g., Polar H10). VO2 max needs indirect calorimetry. If a study uses self-report as the reference, treat results skeptically.

  2. Funding and affiliation — An Oura-funded study ranking Oura first for sleep doesn’t mean the data is wrong, but independent studies finding different rankings should carry more weight. We flag funding for every study below.

  3. Error metric — Correlation alone is misleading. A device can correlate well with PSG but still be biased by 30+ minutes. Look for mean absolute error (MAE), bias, concordance correlation coefficient (CCC), and Cohen’s kappa (κ).


1. Sleep Staging (4-Stage Classification)

Classifying sleep into light, deep, REM, and wake stages is where wearables struggle most — and where study funding has the most visible impact on results. Three major studies compared devices against polysomnography (PSG), and they produced meaningfully different rankings:

Robbins et al. (2024) — Oura-funded

36 participants, multiple nights, Brigham and Women’s Hospital. Funded by Oura Ring Inc. Lead author Dr. Rebecca Robbins is an Oura scientific advisor.

DeviceCohen’s κNotes
Oura Ring Gen 30.65 (Substantial)No significant over/underestimation of any sleep stage
Apple Watch Series 80.60 (Moderate)Overestimated light sleep by 45 min, underestimated deep sleep by 43 min
Fitbit Sense 20.55 (Moderate)Moderate accuracy overall

Park et al. (2023) — Independent

75 participants, 2 centers in Korea, 349,114 epochs analyzed. No industry funding disclosed. With 10× the epoch count and no manufacturer involvement, this study produced notably different rankings — Oura dropped from first to last:

DeviceCohen’s κ
Google Pixel Watch0.4–0.6 (Moderate)
Galaxy Watch 50.4–0.6 (Moderate)
Fitbit Sense 20.4–0.6 (Moderate)
Apple Watch 80.2–0.4 (Fair)
Oura Ring 30.2–0.4 (Fair)

Schyvens et al. (2025) — Independent

62 adults, single-night PSG, University of Antwerp. Funded by VLAIO (Flanders Innovation & Entrepreneurship) — no device manufacturer funding.

DeviceCohen’s κ (4-stage)TST BiasDeep Sleep CorrectREM CorrectWake Specificity
Apple Watch Series 80.53 (Moderate)+19.6 min50.7%68.6%52.2%
Fitbit Sense0.42 (Moderate)+6.3 min48.3%55.5%39.2%
Fitbit Charge 50.41 (Moderate)+11.1 min43.3%47.5%42.7%
WHOOP 4.00.37 (Fair)+24.5 min69.6%62.0%32.5%
Withings Scanwatch0.22 (Fair)+39.9 min29.8%36.5%29.4%
Garmin Vivosmart 40.21 (Fair)+38.4 min32.1%28.7%27.6%

Clinically acceptable = <30 min TST bias. Only Apple Watch, Fitbit Sense, and Fitbit Charge 5 met this threshold. The Schyvens data reveals a universal pattern: every device defaults to “light sleep” when uncertain, systematically inflating light sleep totals and underestimating how often you actually wake up (by 12–48 minutes).

WHOOP Sleep Staging — University of Arizona (2020)

WHOOP achieved 89% agreement on 2-stage classification (sleep vs. wake) but only 64% on 4-stage classification (light/deep/REM/wake), with κ=0.47 against PSG. That’s moderate agreement — better than chance but far from clinical grade.

Deep Sleep Detection Sensitivity — Robbins et al. (2024), Oura-funded

DeviceSensitivityBias
Oura Ring Gen 379.5%No significant bias
Fitbit Sense 261.7%−15 min (underestimates)
Apple Watch Series 850.5%−43 min (underestimates)

Wake Detection Sensitivity — Robbins et al. (2024) & Chinoy et al. (2022)

DeviceSensitivity
Oura Ring Gen 368.6%
Fitbit Sense 267.7%
Apple Watch Series 852.4%
Garmin Vivosmart 427%

Practical takeaway: Single-night stage breakdowns are unreliable from any device. Use multi-night averages for sleep duration trends. If you frequently wake during the night, know that your wearable is almost certainly undercounting those awakenings — regardless of brand.


2. Nocturnal Heart Rate Variability (HRV)

Dial et al. (2025) — Independent

13 participants, 536 nights, Ohio State University / Air Force Research Lab. No industry funding disclosed. Reference: Polar H10 chest strap.

Nocturnal HRV is where Oura’s ring form factor pays off — continuous finger-based PPG during sleep produces a cleaner signal than wrist-based sensors fighting motion noise. The study measured concordance correlation coefficients (CCC) for nocturnal RMSSD:

DeviceCCCMAPERating
Oura Ring Gen 40.995.96% ± 5.12%Nearly Perfect
Oura Ring Gen 30.977.15% ± 5.48%Substantial
WHOOP 4.00.948.17% ± 10.49%Moderate
Garmin Fenix 60.8710.52% ± 8.63%Poor
Polar Grit X Pro0.8216.32% ± 24.39%Poor

CCC scale: >0.99 = Nearly Perfect, 0.95–0.99 = Substantial, 0.90–0.95 = Moderate, <0.90 = Poor.

Important caveat: The Garmin Fenix 6 tested is 2+ generations old. Current Garmin devices (Fenix 7/8, Forerunner 265/965) may perform differently. The study authors acknowledged this limitation.

Practical takeaway: If nocturnal HRV is your primary recovery metric, Oura Gen 4 is the clear leader. But never compare HRV numbers across devices — the algorithms, sampling windows, and artifact filtering are all different. Pick one device and track your own trend over time.


3. Resting Heart Rate (RHR)

Dial et al. (2025) — Same study as above

DeviceCCCMAPERating
Oura Ring Gen 40.981.94% ± 2.51%Nearly Perfect
Oura Ring Gen 30.971.67% ± 1.54%Substantial
WHOOP 4.00.913.00% ± 2.15%Moderate
Polar Grit X Pro0.862.71% ± 2.75%Poor

Note: Garmin Fenix 6 was excluded from RHR analysis due to timestamp reporting issues that prevented alignment with the Polar H10 reference data.

Resting HR is the one metric where you can mostly trust any device. Even the weakest performer (Polar, CCC 0.86) stays under 3% average error. This is the easiest measurement for an optical sensor — at rest, blood flow is steady, motion artifact is near zero, and the signal-to-noise ratio is high.

Practical takeaway: If your resting HR suddenly jumps 5–10 bpm, it likely reflects a real physiological change (illness, stress, overtraining) regardless of which device you’re wearing. This is the most actionable metric across all wearables.


4. Active Heart Rate

WellnessPulse Meta-Analysis (2025)

Active heart rate accuracy (percentage of readings within acceptable error):

DeviceAccuracy
Apple Watch86.31%
Fitbit73.56%
Garmin67.73%
TomTom67.63%

Heart rate correlation vs. ECG during activity (WellnessPulse / PubMed Central aggregate):

DeviceCorrelation (r)
Polar Chest Strap0.99
Apple Watch0.80
Garmin0.52

The gap between resting and active HR accuracy is striking. Apple Watch drops from near-perfect resting agreement to 86% accuracy and r=0.80 during exercise — and it’s still the best wrist device. Garmin’s r=0.52 during activity means its readings are barely correlated with actual heart rate — functionally useless for pacing decisions.

The physics explains this: wrist-based optical sensors measure blood volume changes through skin. During exercise, motion artifact, sweat, and reduced peripheral blood flow all degrade the signal. Activities with grip pressure (cycling, rowing, lifting) or rapid arm movement (boxing, CrossFit) are the worst cases.

Practical takeaway: For zone-based cardio training where HR accuracy matters, a chest strap (r=0.99) is in a different league from any wrist sensor. For steady-state running or walking, Apple Watch is adequate.


5. Blood Oxygen (SpO2)

Various validation studies (PLOS, Nature, etc.)

DeviceMAEMDERMSE
Apple Watch Series 72.2%−0.4%2.9%
Garmin Fenix 6 Pro~4.5%
Withings ScanWatch~4.8%
Garmin Venu 2s5.8%5.5%6.7%
DeviceWithin RangeUnderestimateMissing Data
Apple Watch Series 758.3%24.3%11%
Garmin Fenix 6 Pro~44%~28%28%
Withings ScanWatch~38%~31%31%
Garmin Venu 2s18.5%67.4%14%

These numbers are sobering. The best consumer device (Apple Watch) is only “within range” 58% of the time — meaning it’s wrong in some direction for 4 out of 10 readings. The Garmin Venu 2s underestimates SpO2 in two-thirds of readings and misses data entirely 14% of the time. None of these are FDA-cleared for SpO2 — Apple Watch included.

Practical takeaway: Consumer SpO2 is useful for detecting trends (altitude adaptation, possible sleep apnea patterns over weeks) but should never inform a medical decision. If you see consistently low readings, get a medical-grade pulse oximeter before worrying.


6. Step Count

WellnessPulse Meta-Analysis (2025)

DeviceAccuracy
Garmin82.58%
Apple Watch81.07%
Fitbit77.29%
Jawbone57.91%
Polar53.21%
Oura RingPoor (50.3% error real-world, 4.8% controlled)

Additional MAPE data:

DeviceMAPE
Garmin Vivoactive 4<2%
Fitbit Sense~8%

Step counting is the most commoditized metric — Garmin and Apple Watch are within 1.5% of each other, and even Fitbit’s 77% is serviceable. The real outlier is Oura, which was never designed for step detection (a finger doesn’t swing like a wrist during walking). Edge cases that degrade all devices: slow gait, pushing a stroller or cart, walking with a cane, and arm-intensive activities that get misclassified as steps.

Practical takeaway: For step-based activity goals, any major wrist device works. Don’t obsess over daily counts — look at 7-day rolling averages to smooth out noise.


7. Calories / Energy Expenditure

WellnessPulse Meta-Analysis (2025)

DeviceAccuracy
Apple Watch71.02%
Fitbit65.57%
Polar~50–65%
Garmin48.05%

Oura Ring reports ~87% accuracy (13% average error), but this figure reflects resting/basal metabolic estimates — a fundamentally easier problem than tracking active calorie burn. It’s not directly comparable to the active-exercise accuracy figures above.

This is the weakest metric category overall. Apple Watch leads at 71%, which still means nearly a third of calorie readings are off by a meaningful amount. Garmin’s 48% during activity is essentially no better than guessing. All devices use HR + accelerometer data fed into proprietary algorithms, and accuracy drops further during high-intensity or multi-modal exercise.

Practical takeaway: No consumer wearable is a reliable calorie counter. For body composition goals, track nutrition intake and scale trends — don’t trust the burn number on your wrist.


8. VO2 Max Estimation

Caserman et al. (2024), Lambe et al. (2025), Garmin validations

DeviceMAPEMAEBias Direction
Garmin Forerunner 2455.7%Acceptable for runners
Garmin Fenix 67.05%CCC=0.73 for 30s averages
Apple Watch Series 715.79%6.07 ml/kg/minUnderestimates
Apple Watch (2025 study)13.31%6.92 ml/kg/minMixed

Garmin’s advantage here is substantial — roughly half the error rate of Apple Watch (MAPE 5.7–7% vs. 13–16%). Apple Watch’s MAE of 6–7 ml/kg/min is significant for a metric that typically ranges 30–60 ml/kg/min — that’s a 10–20% relative error. Both devices share a systematic bias: they pull everyone toward the population mean, overestimating sedentary users and underestimating athletes. If your true VO2 max is 55+, expect your watch to lowball it.

Practical takeaway: For VO2 max trending, Garmin is the clear winner. But even Garmin’s number is an estimate — useful for tracking your own trajectory over months, not for comparing against someone else’s device or a lab result.


9. Skin Temperature

Oura Internal Validation (2024)

16 participants, 1 week, 93,571 data points. This is Oura’s own study, not independently peer-reviewed.

DeviceLab AccuracyReal-World AccuracyPrecision
Oura Ringr² > 0.99r² > 0.92±0.13°C (0.234°F) per minute

Independent menstrual cycle tracking studies (Maijala et al., 2019) have validated the utility of nocturnal finger skin temperature for cycle phase detection.

Apple Watch, Garmin, WHOOP, and Samsung all track skin temperature, but limited independent validation data comparing accuracy across devices exists, which is why this metric is excluded from the master summary.

Practical takeaway: Skin temperature trends are useful for cycle tracking and illness detection. Oura has the most published data here, but independent comparative studies are still lacking.


10. Respiratory Rate

Respiratory rate is the least validated metric across all consumer wearables. Most manufacturers claim to track it, but independent comparative studies are essentially nonexistent.

Samsung has published validation data (Park et al., 2023 — Samsung-funded), but cross-device comparisons don’t exist in the literature.

Practical takeaway: Treat respiratory rate as experimental. If a sudden change correlates with other signals (elevated RHR, poor sleep), it may be worth noting, but don’t rely on it in isolation.


11. FDA-Cleared Features

FeatureDeviceStatus
ECG / Atrial Fibrillation DetectionApple Watch (Series 4+)FDA Cleared
ECG / Atrial Fibrillation DetectionSamsung Galaxy Watch (4+)FDA Cleared
Sleep Apnea NotificationApple Watch (Series 9+, Ultra 2)FDA Authorized
Sleep Apnea DetectionSamsung Galaxy WatchFDA De Novo Authorized (Feb 2024)
Blood Oxygen (SpO2)Apple WatchWellness feature (not FDA cleared)
Irregular Rhythm NotificationFitbitFDA Cleared

WHOOP and Garmin have no FDA-cleared features. FDA clearance means the feature has passed validation for a specific clinical use case. “Wellness” features (most HRV, sleep staging, stress scores) have no regulatory oversight.


Important Caveats

These aren’t footnotes — they materially affect how you should interpret everything above:

  1. Study funding matters. The primary sleep study ranking Oura highest (Robbins et al.) was Oura-funded. Two independent studies (Park et al., Schyvens et al.) found different rankings. Weight independent findings more heavily when they conflict.

  2. Device generations matter. The Garmin Fenix 6 and Vivosmart 4 tested in several studies are 2+ generations behind current models. Results may not apply to the Fenix 8 or Forerunner 965.

  3. Small sample sizes. The HRV/RHR study (Dial et al.) had only 13 participants, though 536 nights of data partially compensates. The Antwerp study had 62 participants but only 1 night each.

  4. PSG isn’t perfect either. The “gold standard” polysomnography has inter-rater reliability of κ≈0.75, meaning even human sleep experts disagree ~25% of the time on stage classification.

  5. Skin tone and body composition bias. PPG (optical heart rate) accuracy is affected by skin pigmentation, tattoos, BMI, and wear fit. Most validation studies have predominantly white participants — a critical research gap.

  6. Individual variation is real. Accuracy can differ meaningfully from person to person based on wrist anatomy, skin tone, tattoos, body composition, and how tightly the device is worn. Population-level accuracy figures don’t guarantee your personal experience.

  7. Calorie tracking is weak across all devices. Even the best performer (Apple Watch, 71%) is wrong nearly a third of the time. No consumer wearable should be used as a precise calorie counter.

  8. All wearables default to light sleep when uncertain. Every consumer device tested shows the same conservative algorithmic bias: when in doubt, label it light sleep. This inflates light sleep percentages across the board.

  9. Algorithms update silently. A firmware update can change how your device calculates HRV, sleep stages, or recovery scores. Validation studies test a snapshot in time — your device’s current firmware may produce different results.


Cross-Metric Patterns: What The Data Actually Reveals

Three patterns emerge when you look across all 17 studies together — patterns you won’t see if you only read one study at a time.

Pattern 1: Recovery metrics vs. activity metrics are dominated by different devices

Oura consistently leads metrics measured at rest — nocturnal HRV (CCC 0.99), resting heart rate (CCC 0.98), and even resting calorie estimation (~87%). Apple Watch consistently leads metrics measured during activity — active HR (86.3%), SpO2 (MAE 2.2%), and sleep staging in independent studies (κ=0.53). Garmin leads fitness performance metrics — step counting (82.6%) and VO2 max (MAPE 5.7–7%).

This isn’t coincidental. Oura is a ring — it has excellent skin contact and minimal motion artifact during sleep, but it can’t track wrist movement well (poor step counting) and has no GPS. Apple Watch is a full smartwatch with GPS, accelerometer, and gyroscope — better suited for daytime activity tracking. Garmin’s running-focused algorithms have years of sport-specific tuning.

Pattern 2: Study funding consistently shifts rankings

MetricOura-funded resultIndependent result
Sleep stagingOura #1 (κ=0.65)Oura #5 (κ=0.2–0.4)
Deep sleepOura #1 (79.5%)WHOOP #1 (69.6%)
Wake detectionOura #1 (68.6%)Apple Watch #1 (52.2%)

This doesn’t prove the Oura-funded studies are wrong — but it does mean you should weight independent findings more heavily when the two conflict.

Pattern 3: Every device has the same failure mode for sleep

Across all three sleep studies and all six devices tested, every single one defaults to labeling uncertain epochs as “light sleep.” This inflates light sleep totals and underestimates wake time by 12–48 minutes. It’s a conservative algorithmic choice — manufacturers would rather you think you slept lightly than tell you that you were awake and have their “sleep score” look worse.


Which Device For Your Goal?

Instead of “which is best overall,” the research points to specific devices for specific goals:

If you care most about recovery and readiness: Oura Gen 4 — best-in-class nocturnal HRV (CCC 0.99, MAPE 5.96%), best resting HR (CCC 0.98, MAPE 1.94%). Recovery signals are measured at rest, where Oura’s ring form factor excels.

If you care most about workout accuracy: Apple Watch — leads active HR (86.3% accuracy, r=0.80 vs ECG), best SpO2 (MAE 2.2%), strong sleep staging in independent studies (κ=0.53). For intervals or high-intensity work, pair any wrist device with a Polar chest strap (r=0.99).

If you care most about running/cardio performance: Garmin — leads VO2 max estimation (MAPE 5.7–7% vs Apple Watch’s 13–16%), leads step counting (82.6%), strong activity-specific algorithms. Weak on recovery metrics (HRV CCC 0.87, excluded from RHR analysis).

If you want clinical-grade cardiac screening: Apple Watch or Samsung Galaxy Watch — only devices with FDA-cleared ECG and atrial fibrillation detection.

If you want one device that does everything adequately: Apple Watch — never the worst at anything, top 2 in most activity metrics, only device with FDA-cleared cardiac features. Its main weakness is HRV/recovery tracking, where Oura leads significantly.

The real insight isn’t “buy Device X.” It’s that no single wearable covers every blind spot. Oura can’t tell you how hard your workout was. Apple Watch can’t match Oura’s recovery signal fidelity. Garmin’s VO2 max estimate won’t help you understand why your sleep tanked.

That’s the case for combining sources — not to collect more data for its own sake, but to give yourself enough context to actually interpret what’s happening. When your recovery score drops, you want to know whether it’s the bad sleep, the hard workout, the late meal, or the bedroom temperature. No single wrist (or finger) can see all of that.


Sources

  1. Robbins R, et al. (2024). “Accuracy of Three Commercial Wearable Devices for Sleep Tracking in Healthy Adults.” Sensors, 24(20), 6532. DOI: 10.3390/s24206532Funded by Oura Ring Inc.

  2. Dial MB, et al. (2025). “Validation of nocturnal resting heart rate and heart rate variability in consumer wearables.” Physiological Reports, 13(16), e70527. DOI: 10.14814/phy2.70527Independent (Ohio State / Air Force Research Lab)

  3. Park et al. (2023). “Accuracy of 11 Wearable, Nearable, and Airable Consumer Sleep Trackers: Prospective Multicenter Validation Study.” JMIR mHealth and uHealth, 11, e50983. DOI: 10.2196/50983Independent (Korean multicenter)

  4. Park et al. (2023). “Validating a Consumer Smartwatch for Nocturnal Respiratory Rate Measurements in Sleep Monitoring.” Sensors, 23(18), 7867. DOI: 10.3390/s23187867Samsung-affiliated, Samsung-funded

  5. Khodr R, et al. (2024). “Accuracy, Utility and Applicability of the WHOOP Wearable Monitoring Device in Health, Wellness and Performance — A Systematic Review.” medRxiv. DOI: 10.1101/2024.01.04.24300784

  6. Oura Internal Validation (2024). Temperature sensor validation study. 16 participants, 93,571 data points. Published on Oura blog — Oura internal study

  7. Maijala et al. (2019). “Nocturnal finger skin temperature in menstrual cycle tracking.” BMC Women’s Health, 19, 150. DOI: 10.1186/s12905-019-0844-9

  8. Lanfranchi et al. (2024). Samsung Galaxy Watch SpO2 validation. Journal of Clinical Sleep Medicine, 20(9), 1479–1488. DOI: 10.5664/jcsm.11178Samsung-affiliated

  9. WellnessPulse Meta-Analysis (2025). Accuracy of Fitness Trackers — Aggregate data

  10. AIM7. Smartwatch/Wearable Technology Accuracy — Aggregate validation data

  11. Christakis et al. (2025). “A guide to consumer-grade wearables in cardiovascular clinical care.” npj Cardiovascular Health, 2, 82. DOI: 10.1038/s44325-025-00082-6

  12. PMC/JAMA (2025). “Selecting Wearable Devices to Measure Cardiovascular Functions in Community-Dwelling Adults.” DOI: 10.1016/j.jamda.2025.105529

  13. Schyvens AM, et al. (2025). “Performance of six consumer sleep trackers in comparison with polysomnography in healthy adults.” Sleep Advances, 6(1), zpaf016. DOI: 10.1093/sleepadvances/zpaf016Independent (VLAIO-funded, University of Antwerp)

  14. Caserman P, et al. (2024). “Validity of Apple Watch Series 7 VO2 Max Estimation.” JMIR Biomedical Engineering, 9, e54023.

  15. Lambe RF, et al. (2025). “Validation of Apple Watch VO2 max estimates.” PLOS One, 20(2), e0318498. DOI: 10.1371/journal.pone.0318498

  16. Miller DJ, et al. (2022). “A Validation of Six Wearable Devices for Estimating Sleep, Heart Rate and Heart Rate Variability in Healthy Adults.” Sensors, 22(16), 6317. DOI: 10.3390/s22166317

  17. University of Arizona (2020). WHOOP sleep staging validation vs polysomnography. 89% 2-stage agreement, 64% 4-stage, κ=0.47.


Omnio unifies data from Oura, Apple Watch, Garmin, WHOOP, and more — so you can see what actually matters across all your sources. Join the pre-beta at getomn.io.