Voice tech and AI: Is Detecting Diseases Based on 45 s of Voice Accurate? (Henry O'Connell)

 

Voice biomarkers: the promise is real—but only if we stop analyzing words

Ambient speech tech has exploded in healthcare—mostly as “documentation glue”: speech recognition, note-taking, clinical copilots. But what if the real opportunity isn’t the transcript at all?

In this episode of Faces of Digital Health, Henry O’Connell (co-founder of Canary Speech) argues that the last 25 years of “voice biomarker” progress stalled for a simple reason: the field focused on what people say (words) instead of how the nervous system produces speech (signal).

From words to the “primary data layer”

O’Connell traces the roots of the space through decades of speech and language work - from early NLP to Dragon NaturallySpeaking and far-field speech systems. The industry could find correlations by analyzing spoken words and their textual representation, but it never produced widely adopted clinical products. The limitation, he says, is data density: word-based analysis gives you a few hundred-ish elements; the motor control signals behind speech yield millions.

Canary Speech’s bet was to “turn it upside down”: ignore words and instead analyze the articulatory system—how the central nervous system coordinates breath, vocal cords, tongue, and timing to generate language. It’s the same human machinery regardless of language, and it’s the layer clinicians often “read” intuitively, like when you can hear someone isn’t okay even if they say “I’m fine.”

45 seconds, 13 million data elements, thousands of features

A key claim in the conversation is how little audio is needed: ~45 seconds of conversational speech captured opportunistically during a real clinician–patient exchange. Canary extracts 2,590 voice features every 10 milliseconds (with a 25ms window sliding by 10ms), which adds up to roughly 13 million data elements for the model to evaluate.

Those features include characteristics tied to vocal cord vibration (e.g., jitter/shimmer), respiratory sounds, resonance, and how those properties change over time—plus derivatives (how fast features change and recover). Many of these features were already known in speech science; Canary focuses on stacking multiple extraction approaches and validating them clinically.


How the “ground truth” gets built

Canary partners with clinical teams (neurology, psychiatry, specialty services), designs IRB-reviewed protocols, captures patient audio during real clinical interviews, and uses clinician diagnoses and test batteries as labels—ground truth. Machine learning then learns correlations between the diagnostic labels and the extracted features.

That clinical-first posture shows up repeatedly: Canary positions the output as clinical decision support, not a standalone diagnosis.

Accuracy: very high for progressive neuro, harder for behavioral health

O’Connell reports 98%+ accuracy in progressive neurological diseases like Huntington’s, Parkinson’s, and Alzheimer’s—while behavioral health (anxiety/depression) is typically “in the 80s,” reflecting the underlying variability of diagnoses and symptoms. A central point: the system can measure multiple dimensions at once—e.g., cognitive decline plus depression, fatigue, and anxiety—without extra time burden on the clinician.

Why adoption lagged—and why it’s moving now

If the results can be strong, why hasn’t this become routine? O’Connell points to earlier research norms that relied on structured read speech (scripts) borrowed from speech pathology. The brain regions involved in reading vs conversing differ, he argues, so script-based studies were counterproductive for building conversational biomarkers. Canary avoids read speech entirely.

He also credits recent AI and infrastructure shifts—real-time streaming, fast compute, better transcription and ambient tooling—as catalysts. Canary’s integration with ambient documentation workflows (e.g., a clinical copilot listening anyway) turns voice biomarkers into “no extra workflow” intelligence.

The ethics question: screening, incidental findings, and trust

The conversation goes into the “genetics-like” dilemma: what if you detect something you weren’t looking for? Canary’s framing is that voice biomarkers describe what’s present in the moment, not a future risk. That matters in primary care screening, where objective signals can prompt better follow-up without relying on subjective questionnaires alone.

One example O’Connell shares: a postpartum check where the patient says she’s fine, but Canary flags severe depression—creating space for the clinician to explore and address it sooner.

Beyond diagnosis: workforce safety and operational signals

The most unexpected use case may be in-room monitoring for aggression risk. In U.S. hospitals, violence against nurses is a serious operational and safety issue. Canary can contribute a “green/yellow/red” indicator based on the room environment to help teams enter situations more safely and de-escalate earlier—without targeting a specific person.

Remote monitoring and clinical trials: faster screening and outcome measurement

For pharma and trials, Canary can support:

  • Pre-screening participants (e.g., who is or isn’t demonstrating mild cognitive impairment)

  • Longitudinal tracking to detect changes over time (e.g., whether cognitive function improves with a therapy)

  • More direct outcome measures than proxies like plaque reduction alone

O’Connell hints at a pipeline of additional models (PTSD, postpartum depression, ADHD/autism, pain, COPD) and expanding language coverage with validation in each new language.

What’s next: global rollout and consumer wellness

Canary is already deploying in multiple regions (U.S., Canada, Japan, UK/Ireland, parts of Europe, and expanding in South America and Arabic-speaking markets). A consumer-facing wellness product is described as planned for 2026, with an emphasis that it will still be built on clinically developed models rather than scraping uncontrolled public sources.

Bottom line: This episode is a reminder that “voice in healthcare” isn’t only about transcription. If clinical validation holds at scale, voice biomarkers could become a quiet layer of decision support—always on, non-intrusive, and potentially transformative for early detection, behavioral health, and even workforce safety.



2) Time stamps (chapter-style)

00:12 Intro: voice biomarkers beyond ambient documentation
01:03 Potential vs reality: what voice can (and can’t yet) prove clinically
01:54 Why earlier voice-biomarker work focused on words—and why it stalled
06:34 The “intuition” problem: we hear mood without words
07:33 Canary’s clinical-first approach and global clinical partnerships
08:39 Do you need long-term data? (Agatha Christie example)
09:41 Method: ~45 seconds of conversational speech, ambient capture
10:36 Scale: 2,590 features every 10ms (~13M data elements)
11:02 Where the features come from (vocal cords, respiration, resonance)
14:36 How models are built: IRB, clinician ground truth, ML correlation
17:30 Accuracy + adoption: why it’s not standard practice yet
18:22 Reported performance: 98%+ neuro, ~80s behavioral health
22:44 Culture/language bias concern: why validation per language matters
24:22 Guardrails: validate every new language and population testing
31:07 Why read-speech scripts misled the field; conversational-only stance
34:16 What changed: AI, compute, real-time streaming, workflow fit
37:10 How it’s used: screening vs suspected disease; incidental findings
39:24 Primary care example: postpartum depression flagged despite “I’m fine”
47:56 Wellness/employee use: de-identified dashboards and ethics
53:42 Technical requirements: device capture, signal-to-noise checks
57:10 In-room monitoring: aggression risk signals for staff safety
1:02:22 Clinical trials: pre-screening + measuring therapeutic impact
1:07:14 Global rollout: regions, languages, and partnerships
1:09:42 Consumer access: wellness product planned in 2026
1:12:32 Wrap-up: why this matters as cognitive health needs grow

Hello, World!