How We Avoid Spurious Correlations in Health Data

When you test enough metric pairs, something will appear correlated by chance. Here's how Omnio's correlation engine uses detrending, dual methods, p-values, and FDR correction to separate real signal from noise.

Mac DeCourcy · · Updated April 3, 2026

Health tracking generates a lot of numbers. Steps, sleep stages, HRV, room temperature, training volume, stress, nutrition — dozens of metrics, every day, from multiple devices. The temptation is to correlate everything with everything and see what sticks.

The problem: when you test enough pairs, something will appear correlated by pure chance. And health data has a nasty habit of making those spurious results look plausible. This post explains how Omnio’s correlation engine separates real signal from statistical noise.


Suppose your step count has been gradually increasing over the past three months. Your sleep score has been improving too — maybe because you started sleeping earlier, or the weather got better, or you reduced caffeine. If you compute a simple correlation between steps and sleep score, you’ll get a strong positive result. But the two trends might be completely independent.

This is confounding by shared trend. Both metrics are going up over time, so they look correlated even though one doesn’t cause — or even predict — the other. Health data is especially prone to this because lifestyle changes, seasonal patterns, and habits affect multiple metrics simultaneously.

A naive correlation engine doesn’t know the difference between “these metrics genuinely co-vary day to day” and “these metrics both happen to be trending upward.” We needed to fix that.

Our Approach: Four Layers of Statistical Rigor

1. Minimum Sample Size (n >= 14)

The most impactful single decision. With only 5 data points, random noise can easily produce a correlation coefficient of r = 0.9 — strong enough to look convincing but statistically meaningless.

We require at least 14 paired observations before computing any correlation. At this sample size, a correlation needs to exceed |r| > 0.53 to reach statistical significance (p < 0.05). This eliminates the most common source of spurious results: insufficient data.

If you’ve only been tracking a metric for a week, we won’t show a correlation for it. That’s by design — we’d rather show nothing than show noise.

2. Detrending (Rolling Mean Subtraction)

Before computing any correlation, we remove slow drift from both time series using a 7-day centered rolling mean.

For each day’s value, we subtract the average of the surrounding week. What’s left is the day-to-day variation — the signal that actually tells you whether today’s training volume affected tonight’s sleep, not whether both metrics have been climbing over the past quarter.

This is the same technique used in time series analysis to isolate short-term co-variation from long-term trends. It’s simple, interpretable, and effective at reducing confounding from seasonal patterns and gradual lifestyle changes.

3. Dual Correlation Methods (Pearson + Spearman)

We compute two correlation coefficients for every metric pair:

Pearson correlation measures linear relationships. If Steps goes up by 1,000, does Deep Sleep consistently go up by 0.2 hours? Pearson captures that proportional relationship.

Spearman rank correlation measures monotonic relationships — whether more of A generally means more of B, regardless of the exact proportions. It works by ranking both series (1st, 2nd, 3rd…) and computing Pearson on the ranks. This makes it robust to outliers and non-linear effects.

Why both? A single extreme outlier — one day where you walked 30,000 steps — can dramatically distort Pearson. Spearman barely notices because it only cares about rank order. When Pearson and Spearman agree, you can be more confident the relationship is real. When they disagree, it’s a signal that outliers or non-linearity may be influencing the result.

4. P-Values with False Discovery Rate Correction

Every correlation gets a p-value — a measure of how likely you’d see this result by chance if the two metrics were actually unrelated.

But p-values have a well-known problem in bulk testing: if you test 28 metric pairs at p < 0.05, you’d expect about 1.4 false positives even if none of the relationships are real. Test 100 pairs and you’d expect 5 false discoveries.

We address this with Benjamini-Hochberg correction, which controls the false discovery rate — the expected proportion of false positives among results we flag as significant. After computing p-values for all pairs, we adjust them to account for the number of simultaneous tests. Correlations that don’t survive this correction are marked as “not significant after FDR” so you can see them but aren’t misled by them.

What This Looks Like in Practice

When you view correlations in Omnio, each pair shows:

  • Correlation coefficient — how strong the relationship is
  • Confidence badge — whether it’s statistically significant after accounting for multiple comparisons
  • Interpretation text — a plain-English description that includes caveats when relevant (e.g., “not statistically significant” or “small sample: n=16”)

Results are sorted with statistically significant correlations first. Pairs that failed the FDR correction are still visible — they may represent real patterns that need more data — but they’re clearly marked so you know the difference.

What We Don’t Do (and Why)

We don’t compute partial correlations. Partial correlation controls for specific confounding variables (e.g., “what’s the correlation between training volume and deep sleep, controlling for stress?”). It’s powerful but requires choosing which variables to control for — a decision that can dramatically change results and is easy to get wrong without domain expertise. Detrending handles the most common confounder (shared time trends) without requiring those choices.

We don’t use multivariate regression. Regression models that predict a metric from multiple inputs simultaneously are useful but require a different architecture and careful regularization to avoid overfitting. Our correlation engine is designed to surface pairwise relationships and flag which ones are likely real — a foundation that regression could build on in the future.

We don’t claim causation. A strong, statistically significant correlation between room temperature and sleep score means the two metrics co-vary in your data. It doesn’t prove that changing your thermostat will improve your sleep — though it might be worth trying. We’re careful to frame results as relationships, not mechanisms.

The Takeaway

Most health apps either don’t offer correlations at all, or show raw Pearson coefficients without any statistical qualification. We think that’s worse than showing nothing — it creates a false sense of insight.

Our approach prioritizes not misleading you over showing impressive-looking numbers. Every correlation you see in Omnio has been detrended, tested for significance, and corrected for the number of comparisons we ran. If a result shows up as significant, there’s a meaningful chance it reflects something real in your data.

And if we don’t have enough data to say anything yet, we’ll tell you that too.


Omnio aggregates health data from Oura Ring, Garmin, WHOOP, Apple Health, Google Health Connect, strength training logs, DEXA scans, bloodwork, nutrition tracking, and environment sensors into a unified analytics platform. Learn more at getomn.io.