Why Your Muse Data Needs Cleanup: Artifact Detection for Consumer EEG

Consumer EEG is noisy. Muscle activity, blink, jaw clench, and electrode liftoff produce spectral signatures that drown the real brain signal. How autoencoder-based artifact detection separates brain from noise.

Mac DeCourcy ·

You sit down for a 10-minute meditation session with a Muse headset. You breathe, you try to clear your mind, you catch yourself planning dinner, you return to the breath. The session ends and the app says your “calm time” was 74 percent. A friendly green badge appears.

What the app doesn’t say is that for the first three minutes your jaw was tense from the neighbor’s lawnmower, for 90 seconds in the middle your dog jumped on the couch and your head moved, and for the last two minutes one of the forehead electrodes was lifting as your skin dried. The 74 percent was computed from an EEG signal that was at least 40 percent muscle activity and motion, not brain activity. The number is plausible. It is also almost entirely noise.

This is the artifact problem for consumer EEG, and it’s the least-visible piece of the composite scores with confidence pillar. If a score is built on an EEG signal, and the signal is mostly artifact, the score is mostly artifact. The only thing separating a useful meditation session from a useless one is whether the platform detected the artifacts and propagated that detection into the score’s confidence.


Why Consumer EEG Is Hard

Clinical EEG uses 32 or more gel-based electrodes, a grounded amplifier, and a Faraday-caged room. Consumer EEG uses four dry electrodes, battery power, and whatever room you happen to be in. The signal-to-noise ratio is an order of magnitude different, and consumer devices compensate with aggressive on-device filtering and pretty UI.

The physics haven’t changed, though. The raw voltage traces off a forehead electrode are dominated by non-brain sources:

EMG (electromyography) — muscle activity. The frontalis muscle (forehead) contracts with facial expressions, jaw tension, and squinting. The temporalis (side of head) contracts with chewing and jaw clench. The neck muscles stabilize head position. All of these produce voltages an order of magnitude larger than cortical activity, and their spectrum overlaps heavily with the beta band (13 to 30 Hz) — exactly the band most “focus” and “attention” scores weight.

EOG (electrooculography) — eye movement. Every blink produces a large dipole because the cornea is positively charged relative to the retina. Forehead electrodes pick up blinks as clean, stereotyped pulses. Slower eye movements produce slower drifts. A typical 10-minute session might have 150 to 250 blink events, each contaminating a ~500ms window.

Motion artifacts. Dry electrodes move relative to the skin whenever the head moves. The resulting impedance shift produces a transient that looks like a low-frequency oscillation. Anything above “sitting very still” generates motion artifacts. For users who meditate while moving (walking, fidgeting), this is a structural issue.

Electrode liftoff. Over a long session, skin dries, oils accumulate, sweat changes impedance. A sensor that was in good contact at minute 0 can drift out of contact by minute 15. The artifact that results is usually intermittent and easy to mistake for a real signal change.

Line noise. 50 or 60 Hz (depending on region) from nearby electronics, especially during wireless data transmission. Notch filters help, but harmonics and sidebands slip through.

A consumer EEG stream is a mix of all of these plus the actual cortical signal you’re trying to measure. The cortical signal is typically the smallest contribution. Platforms that don’t aggressively handle this produce scores that reflect how still the user was and how dry the electrodes stayed — not what their brain was doing.

Why Traditional Filters Aren’t Enough

The first thing any EEG pipeline does is bandpass and notch filter. These are essential and free. A bandpass of 1 to 40 Hz removes DC drift and high-frequency noise. A notch at 50 or 60 Hz kills line noise. Good consumer devices have these baked in.

But filters can’t solve the hard cases, because the artifacts overlap with the signal:

  • EMG overlaps with beta. Filtering 13 to 30 Hz out would destroy the beta band itself. There’s no linear filter that removes frontalis EMG while preserving frontal beta.
  • Blinks are broadband. A blink has energy across most of the spectrum, including alpha (8 to 13 Hz) and theta (4 to 8 Hz) — the bands most meditation scores care about. A bandpass pass-through keeps the blink contamination along with the real signal.
  • Motion is slow but large. Motion artifacts are often in the 0.5 to 4 Hz range, overlapping with delta. Delta is part of sleep staging and some focus metrics.

Clinical EEG handles this with independent component analysis (ICA) — a blind source separation technique that splits multi-channel recordings into components that are likely independent sources, then lets the analyst manually or automatically identify and remove artifact components. ICA works. It also needs enough channels (usually 8+) to be well-posed, and consumer devices have 4 at most. ICA is a clinical-grade tool that doesn’t port cleanly to consumer hardware.

This is where learned models come in. If you can’t separate sources analytically, you can learn what “clean” looks like and flag anything that isn’t.

The PSD Autoencoder Approach

An autoencoder is a neural network trained to compress an input and reconstruct it. The middle layer (the “bottleneck”) is a lower-dimensional representation of the input. Training consists of minimizing the difference between the original input and its reconstruction.

The key property: an autoencoder trained only on one class of data — say, clean EEG — learns to reconstruct that class well. When fed data from a different class — contaminated EEG — it reconstructs it poorly. The reconstruction error becomes a proxy for “how much does this input look like the training distribution?”

For EEG artifact detection, the approach looks like this:

  1. Represent each window as a power spectral density (PSD). A 2 to 4 second window of raw EEG gets transformed into a frequency-domain representation — power across 1 Hz bins from 1 to 40 Hz. This throws out phase information but preserves the spectral fingerprint, which is what artifact types differ in. A blink has a different PSD than normal alpha rhythm. Jaw clench has a different PSD than frontal beta. The PSD is a compact representation that captures the artifact / signal distinction well.

  2. Train the autoencoder on clean windows only. You need a curated set of windows known to be clean — eyes closed, still, good impedance, no EMG contamination. This is the hard part, because labeling clean windows requires either an expert or careful session protocols. Once you have a few thousand clean windows, the autoencoder learns to reconstruct that distribution faithfully.

  3. At inference, score each new window by its reconstruction error. Low error (the input looks like training data) means the window is probably clean. High error means the spectrum is unfamiliar, which correlates strongly with the presence of one or more artifact types. The error is a single scalar per window, which makes it easy to aggregate over a session.

  4. Set thresholds and propagate. A threshold on reconstruction error converts the continuous score into a binary clean/artifact label, or a graded quality label (“clean” / “mild contamination” / “severe contamination”). The fraction of clean windows in the session becomes a quality metric. The composite score — focus time, calm time, whatever — gets computed only on clean windows, and the overall session’s confidence reflects the fraction that was usable.

This architecture has a number of useful properties. It doesn’t require labeled artifact examples — the clean set is all you need, and “artifact” is defined implicitly as “things that don’t look clean.” It generalizes across artifact types, because the autoencoder reconstructs only what it was trained on; any deviation (EMG, blink, motion, liftoff) triggers higher reconstruction error. It’s fast enough to run in real time on phone-class hardware.

What It Looks Like in Practice

A 10-minute meditation session produces roughly 150 to 300 non-overlapping 2-second windows. For a user with good electrode contact sitting still, a typical distribution of reconstruction errors might be:

  • 60 to 70 percent of windows at low error — clean enough to compute calm time on
  • 20 to 30 percent at moderate error — usually blinks or brief motion, borderline acceptable
  • 5 to 10 percent at high error — clear artifacts, excluded

For a user with jaw tension throughout or sliding electrodes, the distribution shifts badly:

  • 20 to 40 percent clean
  • 30 to 50 percent moderate
  • 20 to 40 percent severe artifact

A naive pipeline would compute the same “calm time” number from both distributions, treating all windows equally. An artifact-aware pipeline discounts the second session’s score, widens its confidence interval, and surfaces to the user why. Something like: “your session was 22 percent clean, which limits what we can say about your state. Jaw tension or electrode slip likely contributed. Next session, try relaxing the jaw and re-seating the device.”

That message does more for the user’s understanding than a precise-but-wrong 74 percent ever could.

On-Device Inference with ONNX

EEG is raw enough that sending the stream to a server for artifact detection has both latency and privacy costs. Fortunately, a PSD autoencoder is small enough to run on-device.

The deployment pattern:

  1. Train in the cloud. Collect the clean dataset. Train the autoencoder on a fast GPU. Tune the architecture — typically a small fully-connected or 1D conv model, a few hundred thousand parameters at most.

  2. Quantize and export to ONNX. ONNX is an open runtime-neutral format. The model compresses to a few MB or less after quantization (int8 or float16). The accuracy loss is usually small — a well-designed PSD autoencoder is robust to quantization.

  3. Ship the ONNX model with the app. On phone hardware, ONNX Runtime Mobile or a similar engine loads the model and runs inference on each 2-second window in a few milliseconds. The headset feeds raw samples via Bluetooth; the phone computes PSDs, runs the autoencoder, and applies the threshold. The user’s raw EEG never leaves the phone.

  4. Send only the aggregate artifact rate up to the server. The server-side composite score machinery only needs to know “how clean was this session” for confidence purposes. It doesn’t need the raw stream or even the per-window reconstruction errors. Bandwidth drops by orders of magnitude.

This is the same architecture pattern that applies to on-device ML in health in general — train centralized, deploy quantized, keep the sensitive stream local. For EEG specifically, the privacy gain is real: raw brain signals are intimate in a way that most other biometric data isn’t.

Integration with the Composite Score

Artifact detection is only useful if it feeds the score. A platform can have the best autoencoder in the world and still produce misleading numbers if the artifact rate doesn’t flow through to the user-facing dashboard.

The integration points:

  • Per-window gating. Calm time / focus time / other session-level metrics are computed only on clean windows. A 10-minute session with 3 minutes of artifact becomes a 7-minute measurement, not a 10-minute average.
  • Session confidence. The aggregate artifact rate becomes a session quality score between 0 and 1, which enters the confidence calculation for any daily or weekly composite that includes mindfulness / cognitive load / focus inputs.
  • User-visible quality indicator. The dashboard shows a small badge per session — “high quality,” “mixed quality,” “mostly unusable” — so the user can build intuition about what makes sessions clean and what makes them noisy.
  • Session suppression. Sessions below a quality floor (say, less than 30 percent clean windows) don’t get a rendered score at all. They show up in the history as “too many artifacts — try again in a quieter position.”
  • Trend-level confidence. Weekly or monthly meditation trends are weighted by session quality, so a week of mostly-clean sessions counts more than a week of mostly-noisy ones.

This is the same structural logic as source fallback chains and cross-source validation — each input carries its own quality signal, and the composite inherits the variance. The difference is that for EEG, the quality signal comes from a learned model rather than a tolerance threshold, because there’s no external reference to compare against.

Why Most Consumer EEG Platforms Skip This

Building the pipeline described above takes meaningful engineering. You need the dataset-collection infrastructure to curate clean windows, the training stack, the deployment pipeline, and the dashboard plumbing to propagate quality into scores. It’s not a weekend project.

The shortcut is to compute scores on raw data and present a clean-looking number. This is what most consumer EEG apps ship. The user can’t tell the difference on any individual session — every session produces a number, and the numbers feel “about right” most of the time.

The cost is that the scores drift away from actual cognitive state whenever the conditions are imperfect. Jaw tension, head position, room temperature, hydration, and electrode age all shift the artifact profile. A platform without artifact-aware scoring silently rewards the user for sitting still and wearing the headset correctly — which is a valid proxy for “favorable conditions for meditation” but is not a measurement of whether they actually meditated.

This matters more as EEG gets built into larger composite scores. A readiness or recovery model that includes a cognitive-state input from a Muse session with a 60 percent artifact rate is propagating noise into a number that gets used to make training decisions. The right answer isn’t “don’t include EEG inputs.” It’s to include them with their quality score attached and let the composite’s confidence reflect the reality.

For the broader framing of how this fits in composite scoring, see the pillar on composite scores with confidence and the feature page on composite health scores. For the adjacent pieces of the quality-aware stack, see readiness score confidence intervals, source fallback chains, and cross-source validation. For how per-user thresholds apply here — a meditator’s own clean-session baseline vs a population norm — see personalized thresholds vs population norms. For a broader look at HRV and how wearables measure physiological signals, what is HRV and how do wearables measure it is the primer.


Putting It Together

Consumer EEG is a noisy medium wearing a clean UI. The raw signal from a Muse, Mendi, or Flowtime headset contains meaningful cortical activity — and also much larger contributions from muscles, eyes, and skin contact. Without a filter designed to catch the non-brain contributions, the derived scores are measuring how well the user sat still, not how well they meditated.

The essentials of doing it right:

  • Represent each window as a PSD and train an autoencoder on curated clean windows
  • Use reconstruction error as a continuous artifact score per window
  • Apply per-window thresholds to gate which windows enter the session score
  • Aggregate to a session-level quality metric and propagate to the composite’s confidence
  • Deploy the model on-device via ONNX so raw EEG stays local
  • Render quality indicators in the UI and suppress sessions whose quality is below a floor

None of this requires ML research. The autoencoder approach has been standard in anomaly detection for years. What it requires is the engineering discipline to treat input quality as a first-class concern rather than a background detail. A meditation score built on unfiltered EEG is confident-looking guesswork. A meditation score built on artifact-aware EEG is a measurement, and the confidence value attached to it is the difference.