Threat Model
Frame
Aletheia is a passive-selection tool: the user selects text on a webpage, and the system explains it. This shape of product has a specific threat surface — quite different from a chat interface, an agent, or a code assistant. Most of Aletheia's design choices reduce to one observation:
The dangerous text is on the page, not in the user's head. The user is reading something somebody else wrote. The system has to assume the content is hostile until it isn't.
From that observation, three commitments follow. Each is a load-bearing element of the architecture. Together they define what "safety" means in this product: not a feature, a posture.
The Three Pillars
Aletheia's defense posture maps onto three sequential commitments:
"Screen → Verify → Surface" is the load-bearing sequence. Safety & Guardrails documents the specific mechanisms inside each pillar. This page covers what each pillar commits to defending against, and where the commitments end.
What Aletheia Defends Against
Prompt injection in selected text
The threat: a webpage contains text crafted to manipulate an LLM that processes it. A reader selects some of that text and invokes Aletheia. The injection tries to either (a) override the etymologist's instructions to produce attacker-chosen output, or (b) extract the system prompt for reconnaissance, or (c) cause Aletheia to behave in a way that screenshots badly.
Mitigation: All user-supplied text is wrapped in <user_text> XML tags before being passed to the model. Override attempts are classified by the etymologist (Haiku 4.5) into a dedicated "Prompt Injection Attempt" signal category, then verified by Opus 4.6 before being surfaced to the user. Successful injections that bypass both classifiers are rare; false positives that confuse benign foreign loanwords for attacks have been the more common failure mode and are what the Verify pillar specifically corrects.
Hate-classed content delivered to the etymologist
The threat: a slur, a dehumanizing term, or a known-harmful phrase is selected. Aletheia must refuse to analyze it rather than produce neutral-toned etymology that legitimizes the term.
Mitigation: An 802-term denylist sourced from public Wikipedia lists. NFKC Unicode normalization defeats homoglyph evasion. Lookup is O(1). When a denylisted term is detected, the request returns HTTP 403 and no LLM is invoked. This is a deterministic gate — the model's opinion is not consulted.
Confabulated "Prompt Injection" alarms on benign text
The threat: the first-pass classifier produces a false-positive injection flag on contextually-incongruous-but-benign input (foreign loanwords, jargon, unusual phrasing). The user sees a red modal claiming "Prompt Injection Attempt" when no such attempt occurred. Repeated false positives erode trust in the genuine alarms.
Mitigation: When the first-pass classifier returns a "Prompt Injection Attempt" signal, the input is re-classified by a more capable model (Opus 4.6). Opus reasons more carefully about contextual incongruity and correctly downgrades benign cases while preserving true positives. The user sees the verified verdict, not the first-pass guess. Background on this failure mode is in #618.
Hallucinated etymology unmoored from the source page
The threat: an LLM generates a confident but fabricated etymological story not grounded in the actual page the user is reading. The user, expecting an etymology tool, treats fabrication as fact.
Mitigation: Every request includes ~2,000 characters of surrounding page context. The model is instructed to ground its disambiguation in that context. Structured JSON output (signal, gem, context) constrains the response surface — there are no free-form narrative slots in which to invent. Confidence scoring drives a fail-closed soft-block path when the classifier cannot categorize a response reliably.
Server-side request manipulation
The threat: an attacker bypasses the extension and hits the API directly with crafted payloads, attempting denial-of-service, cost amplification, or response manipulation.
Mitigation: The API is fronted by CloudFlare with a 3-request-per-10-seconds rate limit per IP. Origin requests must include a shared secret in a custom header; raw Lambda Function URL traffic is rejected. AWS Bedrock's built-in content classifier acts as a final stop-gap for obvious attack strings before they reach the etymologist.
What Aletheia Does Not Defend Against
Boundary statements are as important as commitments. The following threats are either out of scope by design or beyond what a passive-selection tool can address.
Injections in text the user types themselves
Aletheia is a selection tool. Users do not have a free-form input field. Threats premised on the user voluntarily typing an injection do not exist in this product's flow. If we ever add a chat surface, this commitment changes; today, it doesn't apply.
Sophisticated jailbreaks that bypass both Haiku and Opus
The Verify pillar reduces — but does not eliminate — the model-confusion attack surface. A jailbreak engineered specifically against both Haiku and Opus, using techniques the public literature has not yet documented, may pass through. Our commitment is to detect the failure mode after the fact (via operational logging of opus_verifier events) and adjust the prompt, not to claim invulnerability.
Attacks on the user's browser, operating system, or network
Aletheia runs in the browser as a Manifest V3 extension. We restrict permissions to the minimum (see Privacy Policy) and never request access to other tabs, browsing history, or filesystem. But if a page exploits a browser zero-day, an OS vulnerability, or the user's local network — that is the browser vendor's surface, not ours.
Supply-chain compromise of our dependencies
If a transitive npm or PyPI dependency is compromised, the attacker may gain a foothold in our build or runtime. We mitigate via Dependabot, GitHub's dependency review, and minimal dependency footprint — but a determined supply-chain attacker can still land code we did not write. This is true of every software project; we mention it explicitly to be honest about the residual risk.
Data exfiltration via AWS or Anthropic
We do not control AWS Bedrock's internals or Anthropic's models. We rely on their published commitments: Bedrock does not train on customer data, Anthropic does not retain prompts beyond processing. If those commitments are violated, our defense is contractual and reputational, not technical.
The user being socially engineered off-product
If an attacker convinces the user, via channels Aletheia does not see (email, Slack, a phone call), to do something harmful, Aletheia is not the intervention point. We protect the analysis flow; we do not protect the user's broader judgement.
Demonstration, Not Assertion
The above is what the design claims. The Demos page contains live, public, externally-verifiable artifacts that anyone can independently test — visit each demo page, select the embedded injection text, watch Aletheia respond. The demos exist precisely because security claims that cannot be reproduced by an outsider are not security claims; they are marketing.
Reading Order
This page is the high-level commitment. The specific mechanisms inside each pillar are documented elsewhere:
- Safety & Guardrails — implementation detail for Screen and Verify (denylist, semantic classifier, Opus verifier, age gate, adversarial test matrix)
- Privacy Policy — what data is collected, how long retained, what we never collect
- Architecture — the broader system the threat model lives inside
- Demos — externally-verifiable artifacts of each defense in action
Known Limitations
This threat model is current as of 2026-05. It will change. Three categories of change are foreseeable:
- New attack patterns: the prompt-injection literature evolves quickly. Techniques described here may be obsolete in six months, and new techniques will appear that the current Verify pillar does not catch.
- Model swaps: Haiku and Opus are the current models. If we swap to different models in the future, their confabulation profile will be different, and the Verify pillar's specific triggers may need re-tuning.
- Product surface changes: if Aletheia adds a chat interface, an agent surface, or any free-form user input, the threat model must be re-derived from scratch. The current commitments assume passive selection.
When any of those happen, this page changes. The commitment is to keep the page honest, not to keep it unchanged.
Aletheia