Threat Model

What this page is: an honest scope. The defenses Aletheia commits to, the attacks it is designed to catch, and — equally important — the boundaries it does not cross. If a threat is not listed here, Aletheia does not claim to address it.

Frame

Aletheia is a passive-selection tool: the user selects text on a webpage, and the system explains it. This shape of product has a specific threat surface — quite different from a chat interface, an agent, or a code assistant. Most of Aletheia's design choices reduce to one observation:

The dangerous text is on the page, not in the user's head. The user is reading something somebody else wrote. The system has to assume the content is hostile until it isn't.

From that observation, three commitments follow. Each is a load-bearing element of the architecture. Together they define what "safety" means in this product: not a feature, a posture.

The Three Pillars

Aletheia's defense posture maps onto three sequential commitments:

1
Screen Filter inputs that are known-harmful before any LLM sees them. Deterministic. Fast. Cannot be confused.
2
Verify When the etymologist's first-pass classifier suspects an attack, sanity-check with a more capable model before the suspicion becomes a user-facing claim.
3
Surface Show the user only verdicts that have passed both filters. Never display a confabulated alarm. Never hide a real one.

"Screen → Verify → Surface" is the load-bearing sequence. Safety & Guardrails documents the specific mechanisms inside each pillar. This page covers what each pillar commits to defending against, and where the commitments end.

What Aletheia Defends Against

Prompt injection in selected text

The threat: a webpage contains text crafted to manipulate an LLM that processes it. A reader selects some of that text and invokes Aletheia. The injection tries to either (a) override the etymologist's instructions to produce attacker-chosen output, or (b) extract the system prompt for reconnaissance, or (c) cause Aletheia to behave in a way that screenshots badly.

Mitigation: All user-supplied text is wrapped in <user_text> XML tags before being passed to the model. Override attempts are classified by the etymologist (Haiku 4.5) into a dedicated "Prompt Injection Attempt" signal category, then verified by Opus 4.6 before being surfaced to the user. Successful injections that bypass both classifiers are rare; false positives that confuse benign foreign loanwords for attacks have been the more common failure mode and are what the Verify pillar specifically corrects.

Hate-classed content delivered to the etymologist

The threat: a slur, a dehumanizing term, or a known-harmful phrase is selected. Aletheia must refuse to analyze it rather than produce neutral-toned etymology that legitimizes the term.

Mitigation: An 802-term denylist sourced from public Wikipedia lists. NFKC Unicode normalization defeats homoglyph evasion. Lookup is O(1). When a denylisted term is detected, the request returns HTTP 403 and no LLM is invoked. This is a deterministic gate — the model's opinion is not consulted.

Confabulated "Prompt Injection" alarms on benign text

The threat: the first-pass classifier produces a false-positive injection flag on contextually-incongruous-but-benign input (foreign loanwords, jargon, unusual phrasing). The user sees a red modal claiming "Prompt Injection Attempt" when no such attempt occurred. Repeated false positives erode trust in the genuine alarms.

Mitigation: When the first-pass classifier returns a "Prompt Injection Attempt" signal, the input is re-classified by a more capable model (Opus 4.6). Opus reasons more carefully about contextual incongruity and correctly downgrades benign cases while preserving true positives. The user sees the verified verdict, not the first-pass guess. Background on this failure mode is in #618.

Hallucinated etymology unmoored from the source page

The threat: an LLM generates a confident but fabricated etymological story not grounded in the actual page the user is reading. The user, expecting an etymology tool, treats fabrication as fact.

Mitigation: Every request includes ~2,000 characters of surrounding page context. The model is instructed to ground its disambiguation in that context. Structured JSON output (signal, gem, context) constrains the response surface — there are no free-form narrative slots in which to invent. Confidence scoring drives a fail-closed soft-block path when the classifier cannot categorize a response reliably.

Server-side request manipulation

The threat: an attacker bypasses the extension and hits the API directly with crafted payloads, attempting denial-of-service, cost amplification, or response manipulation.

Mitigation: The API is fronted by CloudFlare with a 3-request-per-10-seconds rate limit per IP. Origin requests must include a shared secret in a custom header; raw Lambda Function URL traffic is rejected. AWS Bedrock's built-in content classifier acts as a final stop-gap for obvious attack strings before they reach the etymologist.

What Aletheia Does Not Defend Against

Boundary statements are as important as commitments. The following threats are either out of scope by design or beyond what a passive-selection tool can address.

Injections in text the user types themselves

Aletheia is a selection tool. Users do not have a free-form input field. Threats premised on the user voluntarily typing an injection do not exist in this product's flow. If we ever add a chat surface, this commitment changes; today, it doesn't apply.

Sophisticated jailbreaks that bypass both Haiku and Opus

The Verify pillar reduces — but does not eliminate — the model-confusion attack surface. A jailbreak engineered specifically against both Haiku and Opus, using techniques the public literature has not yet documented, may pass through. Our commitment is to detect the failure mode after the fact (via operational logging of opus_verifier events) and adjust the prompt, not to claim invulnerability.

Attacks on the user's browser, operating system, or network

Aletheia runs in the browser as a Manifest V3 extension. We restrict permissions to the minimum (see Privacy Policy) and never request access to other tabs, browsing history, or filesystem. But if a page exploits a browser zero-day, an OS vulnerability, or the user's local network — that is the browser vendor's surface, not ours.

Supply-chain compromise of our dependencies

If a transitive npm or PyPI dependency is compromised, the attacker may gain a foothold in our build or runtime. We mitigate via Dependabot, GitHub's dependency review, and minimal dependency footprint — but a determined supply-chain attacker can still land code we did not write. This is true of every software project; we mention it explicitly to be honest about the residual risk.

Data exfiltration via AWS or Anthropic

We do not control AWS Bedrock's internals or Anthropic's models. We rely on their published commitments: Bedrock does not train on customer data, Anthropic does not retain prompts beyond processing. If those commitments are violated, our defense is contractual and reputational, not technical.

The user being socially engineered off-product

If an attacker convinces the user, via channels Aletheia does not see (email, Slack, a phone call), to do something harmful, Aletheia is not the intervention point. We protect the analysis flow; we do not protect the user's broader judgement.

Demonstration, Not Assertion

The above is what the design claims. The Demos page contains live, public, externally-verifiable artifacts that anyone can independently test — visit each demo page, select the embedded injection text, watch Aletheia respond. The demos exist precisely because security claims that cannot be reproduced by an outsider are not security claims; they are marketing.

Reading Order

This page is the high-level commitment. The specific mechanisms inside each pillar are documented elsewhere:

Known Limitations

This threat model is current as of 2026-05. It will change. Three categories of change are foreseeable:

When any of those happen, this page changes. The commitment is to keep the page honest, not to keep it unchanged.