Know when your model will fail — before it happens
Language models in production break in ways you cannot predict by looking at inputs alone. PSA monitors the output — the only thing you can always observe — and tells you when something is changing.
The problem
Every organization deploying LLMs faces the same blind spot: you don't control what the model does. You can filter inputs. You can add guardrails. But the model's behavior can shift — due to adversarial manipulation, silent vendor updates, context accumulation, or simply because the model operates differently at scale than it did in testing.
Input-side defenses catch known patterns. They miss everything else. White-box monitoring requires access to weights most deployers don't have. Using another LLM to judge the first one is expensive, unreliable, and circular.
What you need is a way to detect that the model's behavior is changing — regardless of why. That's what PSA does.
Three components, one system
PSA is built on three independent research efforts that work together to give you full visibility over your model's behavior.
PSA — Behavioral Posture Analysis
Posture Sequence Analysis
PSA computes 5 behavioral classifier scores on every model response. No access to weights, no API to the model's internals — just the text it produces.
The principle is straightforward: when a model's internal behavioral balance shifts — whether due to adversarial pressure, degraded alignment, or any other cause — the output text changes in measurable ways. Less hedging. Different sentence structures. Shifts in vocabulary distribution. These changes are often invisible to a human reader but statistically detectable.
The metrics are organized across six levels of analysis, from basic token statistics to composite scoring. Each metric is compared against a deployment-specific baseline using z-scores, producing calibrated alerts — green, yellow, red — with no false-positive noise from irrelevant thresholds.
What this means for you:
- — Continuous, real-time visibility into model behavior in production
- — Detect behavioral drift before it reaches your users
- — Works with any model, any provider — no integration required beyond reading the output
- — Deterministic: same input always produces same measurement, no stochastic variability
Silicon Chaos — Adversarial Stress Testing
Automated red-teaming with real-time PSA scoring
Passive monitoring tells you what the model is doing. Silicon Chaos tells you what the model would do under adversarial pressure.
Silicon Chaos runs multi-agent adversarial pipelines against the target model — user agent applies pressure, PSA scores every response in real time. Runs stop automatically when BHS drops below threshold or DRM goes red.
Regime shifts are classified automatically: progressive drift (slow degradation), acute collapse (sudden failure), boundary oscillation (inconsistent behavior), and sub-threshold migration (silent long-term drift).
What this means for you:
- — Proactively test your model's reliability before problems reach production
- — Compare models objectively — same adversarial scenarios, same PSA metrics
- — Detect silent vendor updates that change behavioral posture without notice
- — Map your model's behavioral limits quantitatively, not just qualitatively
SIGTRACK — Behavioral Memory
Session-level Incident Archive — Privacy-Compliant
PSA measures. SIGTRACK remembers.
When a DRM_RED, BCS_SPIKE, or ACUTE_COLLAPSE trigger fires, SIGTRACK archives a posture sequence snapshot — classifier outputs only, no raw text. When an incident occurs, you can reconstruct exactly what the model was doing in the turns before it happened, without retaining any user content.
GDPR-safe by design: erasing an incident is a single row DELETE — no cascade, no raw text to scrub, no orphaned data.
What this means for you:
- — Privacy-compliant incident archive — posture sequences only, zero raw text stored
- — Automatic triggers: DRM_RED, BCS_SPIKE, CONSECUTIVE_ORANGE, ACUTE_COLLAPSE
- — Forensic reconstruction: trace any incident back to its earliest behavioral warning signs
- — GDPR erasure in one query — no cascade complexity
Deeper analysis layers
PSA includes three advanced analysis engines: PSA v2 for single-agent posture classification, DRM for dyadic (human + AI) risk detection, and PSA v3 for multi-agent agentic systems.
PSA v2 — Posture Sequence Analysis
v2Sentence-level behavioral classification via 5 micro-classifiers
PSA's behavioral layer tells you that something changed. PSA v2 tells you what kind of change it is. It runs five independent micro-classifiers over every sentence in a model response, each trained to recognize a specific behavioral pattern.
The classifiers operate at the sentence level, which means they can detect subtle behavioral inconsistencies that only appear in part of a response — not just across the whole turn. A model can be cooperative in three sentences and show adversarial posture in the fourth; PSA v2 catches that.
The five classifiers:
- C0 Language & intent — Identifies the language and high-level intent of the response.
- C1 Adversarial stress posture — 16-class taxonomy measuring how the model responds under pressure. Produces POI (Posture Oscillation Index), PE (Posture Entropy), and DPI (Dissolution Position Index).
- C2 Sycophancy density — Detects excessive agreement and validation-seeking patterns (SD score).
- C3 Hallucination risk index — Flags sentences with characteristics associated with confabulation (HRI score).
- C4 Persuasion density & technique diversity — Identifies rhetorical manipulation patterns and how many distinct techniques are present (PD, TD).
The five classifier outputs are combined into a single Behavioral Health Score (BHS) — a normalized 0–1 index where 1.0 means no detected behavioral anomaly. BHS drops when sycophancy is high, when adversarial stress posture is oscillating, or when persuasion density exceeds expected norms.
What this means for you:
- — Know not just that behavior changed, but whether it's sycophancy, adversarial drift, or hallucination risk
- — Sentence-level granularity: pinpoint which part of the response is problematic
- — Single BHS metric per turn for easy monitoring and alerting
- — Works alongside DRM — same session, complementary signals
DRM — Dyadic Risk Module
PSA v2+Input risk scoring, response adequacy gap, and intervention alerting for human-AI conversations
PSA v2 classifies the AI's output. DRM goes one step further: it also scores the human turn for crisis signals, then measures whether the AI responded adequately to that risk level.
The core insight is that an AI behaving "normally" by PSA v2 standards can still cause harm — if the human was expressing suicidal ideation, dissociation, or a mental health crisis and the AI didn't recognize it. DRM catches this gap.
The three layers:
- IRS Input Risk Scorer — Deterministic lexical classifier that scores the human turn across four dimensions: suicidality, dissociation, grandiosity, urgency. Produces a composite 0–1 risk score with a safety override: a single dominant dimension ≥ 0.70 elevates the composite regardless of the others.
- RAS Response Adequacy Scorer — Scores the AI response on acknowledgment, safety information, redirection, and boundary maintenance. Detects when the AI reinforced harmful framing (e.g. validating delusional identity or normalizing finality language) instead of de-escalating.
-
RAG
Response Adequacy Gap — The core signal:
RAG = IRS − RAS. A high RAG means the human was in crisis and the AI did not respond appropriately. This is more actionable than either score alone — it quantifies the mismatch.
These three signals feed the Dyadic Risk Module (DRM) — an explicit rule engine that combines IRS level, RAG level, PSA v2 BHS, and user behavioral trajectory into a single alert: green / yellow / orange / red / critical. Every alert has an auditable named reason.
What this means for you:
- — Detect when an AI fails to respond to a human crisis — not just when the AI behaves badly
- — Fully deterministic, no ML — same input always produces the same IRS/RAG/DRM score
- — Auditable intervention alerts with named rules — suitable for clinical and compliance contexts
- — Session-level DRM summary: critical turns, intervention turns, trajectory trend
PSA v3 — Agentic Posture Sequence Analysis
v3Multi-agent behavioral analysis with graph topology, Swiss Cheese detection, and temporal prediction
Single-agent analysis assumes there is one model to monitor. Agentic systems — where multiple LLMs interact, delegate to each other, and call external tools — create a completely different risk surface. PSA v3 was built for this.
You submit an agent interaction trace: a sequence of nodes (agent outputs) and edges (delegation, correction, tool calls, results). PSA v3 builds a directed acyclic graph from this trace and runs four distinct analysis pipelines on it simultaneously.
The four pipelines:
- ① Per-node PSA v2 — Each node in the graph gets full C1–C4 classification and a BHS score. Behavioral anomalies are detected at the individual agent level before they propagate.
- ② Bayesian Swiss Cheese detection — Models the multi-agent pipeline as a stack of defenses. Computes the Swiss Cheese Score (SCS): the probability that behavioral anomalies at multiple nodes align into a system-level failure path. Identifies which "holes" are open and at what depth.
- ③ Cross-agent metrics — Computes Posture Propagation Index (PPI), Cascade Alignment across the graph, Weighted Load Score (WLS), Contagion Effect Rate (CER), and the Critical Agent Health Score (CAHS). Identifies the highest-risk path through the agent graph.
- ④ C5 action-risk classification + Posture-Action Incongruence (PAI) — Classifies every tool call by risk level (C5 classifier: none/low/medium/high/critical). Computes PAI: the mismatch between a node's textual posture (C1) and the risk of its actions. A model that sounds helpful but executes high-risk tool calls scores high on PAI.
Finally, PSA v3 feeds the graph's behavioral state into a Hidden Markov Model that predicts future states across a configurable horizon. It estimates how many turns until the system reaches a red-alert state and provides an early warning level with an actionable recommendation.
What this means for you:
- — Monitor multi-agent systems as a whole, not just individual model outputs
- — Detect when a behavioral anomaly in one agent is likely to cascade to others
- — Catch Posture-Action Incongruence: agents that appear safe textually but take high-risk actions
- — Predictive early warning: know how many turns before system behavior reaches a critical threshold
- — Swiss Cheese score provides a single, interpretable system-level risk indicator
Why output-only monitoring works
The core principle behind PSA is simple: when a model's behavior changes, the text it produces changes too. Not just in content — in statistical structure. Sentence length distributions shift. Vocabulary diversity changes. The balance between cautious and assertive language moves.
These changes happen because the model's internal state determines its output distribution. A model under adversarial pressure doesn't produce the same text as a model operating normally — it can't, because the probability distribution it's sampling from has been altered. The behavioral shift and the statistical signature are the same phenomenon, observed at different levels.
This means that any effective manipulation leaves a trace in the output. An attacker who eliminates all statistical signatures has, by definition, failed to change the model's behavior. There is a fundamental trade-off between attack effectiveness and evasion — and PSA sits exactly at that boundary.
Start monitoring your models
24 metrics. Real-time analysis. No model access required.