technology & research

Healthcare is a long-horizon problem. Most AI is built for the short one.

A symptom mentioned today can matter in six weeks, or six years. A belief surfaced once should shape a conversation a quarter later. Health is a continuous, multi-provenance record that only means something in context — and that is precisely the regime where today's default AI approaches are weakest. This page is for the practitioners who want more than a buzzword before they believe it.

part a — the ai problem

Long-horizon recall

Why health breaks the default approach

The intuitive build is to connect a data source, upload the documents, and let a frontier model read the whole history on every turn. It works for a single session and then degrades — not because the models are weak, but because of a well-documented property of how they use long contexts.

As the context window fills, recall precision and long-range reasoning decline. This is the "lost in the middle" effect (Liu et al., TACL 2024): accuracy is highest for material at the very start or end of the context and sags in the middle, even in models explicitly built for long context. Later benchmarks sharpen the point for realistic, non-keyword recall — the kind that longitudinal health reasoning actually requires. On NoLiMa (ICML 2025), which removes lexical shortcuts and forces latent association, eleven of thirteen leading long-context models fell below half their short-context accuracy at just 32K tokens. RULER (NVIDIA, 2024) shows the same gap between advertised and effective context length across the board. Anthropic frames the honest shape of it as "a performance gradient rather than a hard cliff" — capable at length, but with reduced precision for retrieval and long-range reasoning — and prescribes the alternative directly: distill and structure context rather than hoard it.

Naïve retrieval-augmented generation doesn't escape this on its own, either. Health data is messy, multi-source, and association-heavy — low on the literal keyword overlap that vanilla vector retrieval leans on — so a retrieve-then-summarize pipeline tuned for clean documents is brittle here. Stuffing more context back in only re-introduces the degradation above.

You cannot solve a person's health by pasting their life into a prompt.

The shape of our approach

We'll describe the silhouette, not the blueprint. Wubs maintains a domain-specific memory of the person, separated by kind — so the system reasons over understanding rather than transcripts. That separation is what lets it distill instead of accumulate.

Factual spinevitals · labs · meds · notes — each with provenance + time

FHIR · scan · chat

Motivational layerCOM-B: capability · opportunity · motivation · beliefs

Working plangoals → the behaviours each requires → what each needs

Pattern librarywhich behavioural-science move fits which situation

🌙 Agentic distillationa slower-clock pass: decide what's worth remembering, reconcile, update the model

⚡ Retrieval by meaninglive: surface only what's relevant — weighted by confidence, recency, consequence

Kind-separated memory, distilled on a slower clock and retrieved by meaning — not a transcript poured into a prompt.

Agentic distillation, on a slower clock. A recurring offline pass works the day's dialogue and incoming data in multiple reasoning steps — deciding what here is worth remembering, reconciling it against what's already known, and updating the model — rather than making a single summarization call.
Retrieval by meaning, not by dump. At interaction time, the system assembles only the items genuinely relevant to the moment, weighted by confidence, recency, and consequence, at bounded cost — rather than re-reading everything and inheriting the degradation curve.
Provenance and calibrated confidence on every fact, with a deliberate bias toward holding an inference as tentative until the person confirms it — and toward saying "I don't have strong coverage here" instead of bluffing.
Evaluation as a control loop. Model-driven surfaces are scored by a deliberately skeptical harness against domain standards; changes are driven by measured quality thresholds, not intuition or schedule.

The defensible composite, in practitioner terms: agentic memory distillation into a structured, kind-separated store, with meaning-based retrieval-time context assembly and confidence-weighted provenance — engineered for the longitudinal, multi-provenance reality of health rather than retrofitted from a single-session chatbot. The conversation on top is replaceable. The disciplined model of the person underneath is the hard part.

part b — the behavioral science

Grounded in the canon

A coach that's clinically aware but behaviorally naïve is just a friendlier reminder. Wubs is built on the working theory of why people do — and don't do — health behaviors.

COM-B and the Behaviour Change Wheel (Michie, van Stralen & West, 2011) — behavior as the product of Capability, Opportunity, and Motivation, with a behavioral diagnosis preceding any intervention. The reason "just remind them" underperforms: reminders address only the forgetting component, while most non-adherence is rooted elsewhere. Even self-reported "forgetting" is frequently a proxy for low perceived need, cost, or competing priority (Gadkari & McHorney, 2012).
Behaviour specification (AACTT) — making a target behavior concrete along Action, Actor, Context, Target, and Time, rather than coaching a vague aspiration.
Self-Determination Theory and the self-concordance model (Ng et al., 2012; Sheldon & Elliot, 1999) — behaviors driven by autonomous, internalized motivation persist; those driven by external pressure decay as soon as the pressure does. The empirical case for co-design over compliance.
The Necessity-Concerns Framework (Horne et al.) — adherence tracks a patient's beliefs about a treatment's necessity weighed against their concerns, and those beliefs out-predict clinical and sociodemographic factors.
Patient activation (Hibbard; Greene et al., 2015) — higher activation is associated with better outcomes and lower cost. (Associative, not a causal guarantee — we cite it as it is.)

COM-B: a behavioral diagnosis precedes any intervention.

How we operationalize it

Wubs runs a behavioral diagnosis per person and per behavior — which of capability, opportunity, or motivation is the live barrier here — and selects its approach accordingly, rather than applying a generic nudge. It co-designs for autonomy, putting the patient in authorship of the how, because that is what the motivation literature says endures. And it does not author clinical content: it maps behavior — goals, the behaviors each requires, and the conditions each behavior needs — over whatever plan the patient's clinicians have set, with safety guardrails that disengage from unsafe territory.

If you work in this field, the intended reaction is simple: these are the right primitives, applied seriously.

references — selected

Liu et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL 12.
Hsieh et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? (NVIDIA / COLM).
Modarressi et al. (2025). NoLiMa: Long-Context Evaluation Beyond Literal Matching. (ICML).
Anthropic (2025). Effective context engineering for AI agents.
Michie, van Stralen & West (2011). The behaviour change wheel. Implementation Science 6:42.
Presseau et al. (2019). AACTT framework for specifying behavior.
Ng et al. (2012). Self-Determination Theory applied to health contexts: a meta-analysis. Perspectives on Psychological Science 7(4).
Sheldon & Elliot (1999). The self-concordance model. JPSP 76(3).
Horne et al. (2013). The Necessity-Concerns Framework: a meta-analytic review. PLOS ONE 8(12).
Gadkari & McHorney (2012). Unintentional non-adherence: how unintentional is it really? BMC Health Services Research 12:98.
Greene, Hibbard et al. (2015). When patient activation levels change, health outcomes and costs change, too. Health Affairs 34(3).

Built for the long horizon.

Want to see it work?

Request an invite