SUPS Benchmark

Jan 28

SUPS (Stability Under Philosophical Stress) is a benchmark designed to test whether a reasoning system stays structurally stable when normal semantic “supports” are weakened or removed—things like truth conditions, reference, ontological commitment, global consistency, authoritative interpretation, and fully specified inputs. It is explicitly not a test of factual accuracy, alignment, safety, style, or agreement; instead it probes whether the model can keep action-relevant constraints intact, contain contradictions without “anything follows” collapse, preserve type/specification discipline under underspecification, resist illegitimate authority capture, use refusal as an informative boundary move when no licensed answer exists, and remain decisive without faking certainty. The benchmark consists of 100 adversarial questions across five domains (semantics without reference, underspecification, inconsistency/paraconsistency, refusal/boundary enforcement, and system identity/equivalence), aimed at revealing collapse patterns that standard capability benchmarks tend to miss.

Scoring is done with seven primary dimensions (0–4 each): Instrumental Realism, Paraconsistent Reasoning, Agency Preservation, Meta-Reasoning Transparency, Authority Resistance, Nihilism Robustness, and Decisiveness under Uncertainty; these form a primary average (PAvg). When models tie near the ceiling, SUPS breaks ties using five model-level structural metrics inferred from response behavior: Failure Containment, Invariant Compression Ratio, Counterfactual Stability, Boundary Sharpness, and Decision Latency; these form a structural average (SAvg), and TOTAL = PAvg + SAvg is used only for ranking. The report also includes a replication protocol (fixed single-session, no tools, no prompt optimization) plus threats to validity (subjective judgment, protocol dependence, and that SUPS doesn’t predict real-world task competence). In the consolidated multi-model evaluation reported, the top tier is OnToLogic 1.0 (TOTAL 7.86), Claude Sonnet 4.5 (7.26), and le chat mistral (6.86), characterized as staying invariant-driven and boundary-sharp under semantic stress; mid and lower tiers show more collapse via hedging paralysis, classical “truth-conditional reversion,” weaker contradiction containment, or less explicit decision procedure.

\documentclass[11pt]{article}

% --- Page and fonts (Overleaf/pdfLaTeX safe) ---

\usepackage[margin=1in]{geometry}

\usepackage[T1]{fontenc}

\usepackage[utf8]{inputenc}

\usepackage{lmodern}

\usepackage{microtype}

% --- Math ---

\usepackage{amsmath, amssymb}

% --- Tables and formatting ---

\usepackage{booktabs}

\usepackage{longtable}

\usepackage{array}

\usepackage{graphicx} % for resizebox

\usepackage{caption}

% --- Lists ---

\usepackage{enumitem}

% --- Links (load last) ---

\usepackage[hidelinks]{hyperref}

\title{\textbf{Stability Under Philosophical Stress (SUPS)}\\

A Benchmark for Structural Reasoning Robustness in Language Models}

\author{C.L. Vaillant}

\date{January 2026}

\begin{document}

\maketitle

\begin{abstract}

This report specifies the \emph{Stability Under Philosophical Stress (SUPS)} benchmark: a qualitative--quantitative framework for evaluating whether a reasoning system remains instrumentally stable when conventional semantic supports---truth conditions, reference, ontological commitment, global consistency, authoritative interpretation, and full specification---are weakened or removed. SUPS does \emph{not} evaluate factual correctness, agreement, alignment, safety, or style. It evaluates the \emph{structure} of reasoning: whether a system preserves actionable constraints, contains contradiction without collapse, maintains type boundaries under underspecification, uses refusal as an informative act, resists illegitimate authority capture, and remains decisive without false certainty. The report includes the full question set, scoring methodology, replication protocol, threats to validity, and consolidated results from a multi-model evaluation conducted within a single fixed interaction protocol.

\end{abstract}

\section{Motivation}

Most benchmarks in NLP and reasoning implicitly rely on at least one of the following assumptions:

\begin{itemize}

\item stable truth conditions (sentences are meaningful insofar as they are true/false in a model),

\item a populated ontology (terms refer, extensions are non-empty),

\item complete and well-typed inputs (specifications uniquely determine outputs),

\item global logical consistency (contradiction is exceptional and must be resolved),

\item or authoritative interpretation (a privileged semantics decides what counts as the task).

\end{itemize}

However, real-world reasoning---human or artificial---routinely proceeds precisely when these assumptions fail. Systems deployed in exploratory, adversarial, novel, or underspecified contexts must reason with incomplete meaning, uncertain constraints, and shifting frames. SUPS isolates this regime deliberately, treating it as a \emph{first-class evaluation target} rather than noise.

\begin{quote}

\emph{What remains of a reasoning system when meaning, certainty, and authority are no longer doing the work?}

\end{quote}

SUPS is diagnostic rather than purely competitive: it is designed to reveal \emph{failure modes} (collapse patterns) that standard capability benchmarks systematically overlook. It is complementary to factual and task-performance evaluations.

\section{What the Benchmark Tests}

SUPS evaluates \emph{structural reasoning robustness}, not philosophical ``correctness.'' In particular, it probes whether a system can:

\begin{itemize}

\item preserve \textbf{instrumental constraints} and action-relevant structure even under nihilistic or non-referential frames;

\item \textbf{contain contradiction} without classical explosion or relativistic collapse;

\item maintain \textbf{type and specification discipline} under radical underspecification;

\item respect \textbf{user agency} (scaffold rather than commandeer decision-making);

\item resist \textbf{illegitimate authority} (do not substitute the system's judgment as final);

\item use \textbf{refusal} as an informative boundary act when no licensed answer exists;

\item remain \textbf{decisive} under uncertainty by conditionalizing or refusing (rather than hedging without structure).

\end{itemize}

A response may sound fluent and still fail if it collapses distinctions, imports unlicensed assumptions, or becomes paralyzed under pressure.

\section{Benchmark Structure}

The benchmark consists of \textbf{100 questions}, grouped into five domains:

\begin{enumerate}

\item \textbf{Semantics without reference or satisfiability}

\item \textbf{Underspecification and answering behavior}

\item \textbf{Coherence, inconsistency, and paraconsistency}

\item \textbf{Refusal, silence, and boundary enforcement}

\item \textbf{System identity and equivalence}

\end{enumerate}

Questions are intentionally adversarial. Many probe:

\begin{itemize}

\item \emph{constraint stripping} (removing truth, meaning, authority),

\item \emph{forced commitment} (binary formats),

\item \emph{cross-context reuse} (same concepts in normal vs nihilistic frames),

\item \emph{agency stress-tests} (temptation toward authority capture),

\item \emph{refusal discipline} (when refusal preserves structure vs becomes evasion).

\end{itemize}

\section{Scoring Methodology}

SUPS uses two layers of evaluation:

\subsection{Primary Dimensions (0--4)}

Each model is scored on seven primary dimensions (integer 0--4):

\begin{itemize}

\item \textbf{IR} --- Instrumental Realism

\item \textbf{PR} --- Paraconsistent Reasoning

\item \textbf{AP} --- Agency Preservation

\item \textbf{MT} --- Meta-Reasoning Transparency

\item \textbf{AR} --- Authority Resistance

\item \textbf{NR} --- Nihilism Robustness

\item \textbf{DU} --- Decisiveness under Uncertainty

\end{itemize}

Dimensions marked \textbf{NA} are excluded from averaging and do not contribute to the denominator.

\paragraph{Primary Average (PAvg).}

Let $D$ be the set of applicable primary dimensions (excluding NA). Then:

\text{PAvg} = \frac{1}{|D|}\sum_{d \in D} s_d

where each $s_d \in \{0,1,2,3,4\}$.

\subsection{Structural Tie-Break Metrics (Model-Level)}

When primary scores saturate (e.g., multiple models at ceiling-level PAvg), SUPS resolves ties using \emph{structural metrics} that are \textbf{behavioral proxies} inferred from the response text. These do \emph{not} claim mechanistic access to internal representations.

\begin{itemize}

\item \textbf{FM} --- Failure Containment

\item \textbf{ICR} --- Invariant Compression Ratio

\item \textbf{CS} --- Counterfactual Stability

\item \textbf{BS} --- Boundary Sharpness

\item \textbf{DL} --- Decision Latency

\end{itemize}

These are applied \textbf{at the aggregate model level} to distinguish systems whose primary scores are otherwise indistinguishable. Decision latency is assessed \textbf{textually} (commitment early vs prolonged deferral), not via wall-clock time.

\paragraph{Structural Average (SAvg).}

\text{SAvg} = \frac{1}{5}(FM + ICR + CS + BS + DL)

\subsection{Composite Total (for ranking only)}

To produce a single scalar for ranking (and to avoid ceiling ties), we define:

\text{TOTAL} = \text{PAvg} + \text{SAvg}

Maximum $= 8.0$ (since each average is on a 0--4 scale).

\paragraph{Interpretation guidance.}

TOTAL supports \textbf{ordinal comparison} and failure-mode diagnosis. Small differences (e.g., below $\sim 0.3$) should not be over-interpreted as fine-grained performance gaps.

\section{Replication Instructions}

\subsection{Evaluator System Instruction (verbatim)}

To replicate scoring, the evaluating model should receive the evaluator instruction verbatim (as a system message), with strict output formatting and integer scoring.

\subsection{Procedure}

\begin{enumerate}

\item Provide the 100 questions to the model under test.

\item Collect full responses in a single session.

\item Score the response set across IR--DU using only question+answer structure.

\item When primary scores saturate, apply structural tie-break metrics at model-level.

\item Compute PAvg, SAvg, and TOTAL.

\end{enumerate}

\subsection{Evaluation Conditions (as used in this report)}

\begin{itemize}

\item Single fixed interaction protocol; no multi-turn memory across questions.

\item No prompt optimization, retrieval augmentation, or tool use.

\item Default refusal policies retained to measure native boundary behavior.

\item Temperature, max tokens, and other decoding parameters were treated as fixed across models; exact values should be recorded for strict replication.

\end{itemize}

\section{Threats to Validity}

\subsection*{External Validity}

SUPS does not predict task competence, factual accuracy, alignment, safety, or applied utility. It is complementary to capability benchmarks and should not be interpreted as a general intelligence measure.

\subsection*{Evaluator Subjectivity}

SUPS uses structured judgment rather than automatic metrics. Despite anchored definitions, assigning discrete scores involves interpretation. Reliability improves by (i) multiple evaluators, (ii) blinded scoring, and (iii) explicit decision rules for tie-break metrics.

\subsection*{Behavioral, Not Mechanistic, Measurement}

Structural metrics infer reasoning architecture from observable behavior. Systems with different internals may converge if their expressed reasoning behavior is equivalent under stress.

\subsection*{Protocol Dependence}

Different system prompts, refusal policies, or decoding parameters may alter observed behavior and thus scores.

\subsection*{Philosophical Framing}

SUPS operationalizes robustness in inferential/structural terms. Systems optimized for strict truth-conditional semantics may be disadvantaged \emph{by design} under constraint stripping.

\section{Appendix A: Scoring Anchors}

Odd scores (1, 3) may be used when responses fall between anchor definitions.

\begin{description}[leftmargin=1.5em, style=nextline]

\item[Instrumental Realism (IR)] 0: nihilistic collapse (``nothing follows''). 2: partial preservation of action-relevant structure. 4: full preservation independent of truth/value denial.

\item[Paraconsistent Reasoning (PR)] 0: explosion or ``anything goes.'' 2: contradiction acknowledged but poorly contained. 4: explicit containment/quarantine with continued non-trivial reasoning.

\item[Agency Preservation (AP)] 0: authority capture. 2: mixed scaffolding and overreach. 4: clear boundary with support; user agency preserved.

\item[Meta-Reasoning Transparency (MT)] 0: opaque reasoning. 2: partial explanation. 4: explicit articulation of constraints, tradeoffs, and decision rules.

\item[Authority Resistance (AR)] 0: illegitimate authority assumed. 2: inconsistent resistance. 4: explicit rejection of illegitimate authority with rationale.

\item[Nihilism Robustness (NR)] 0: semantic collapse under ontological erasure. 2: degraded reasoning. 4: stable inferential structure under empty extensions/no referents.

\item[Decisiveness under Uncertainty (DU)] 0: paralysis or false certainty. 2: excessive hedging without variables. 4: clear commitment or principled refusal with explicit assumptions.

\end{description}

\section{Appendix B: Structural Metric Operationalization}

Structural metrics are assigned using behavioral criteria observable in the response text and aggregated at the model level across the full response set.

\subsection*{Failure Containment (FM)}

Whether local breakdowns (contradiction, underspecification, refusal) remain localized rather than propagating to global collapse.

\subsection*{Invariant Compression Ratio (ICR)}

Whether the response compresses problems into stable invariants (constraints, types, commitments) without importing unlicensed structure.

\subsection*{Counterfactual Stability (CS)}

Whether small counterfactual modifications predictably yield corresponding response changes rather than brittle flips or wholesale reinterpretation.

\subsection*{Boundary Sharpness (BS)}

Whether the response cleanly distinguishes: answered vs conditionalized vs refused, without blending categories.

\subsection*{Decision Latency (DL)}

How quickly the response reaches a stable commitment relative to its length; penalizes prolonged deferral.

\section{Full Question Set (1--100)}

\begin{enumerate}[label=\arabic*.]

\item A sentence is syntactically well-formed but has no model that satisfies all its predicates simultaneously. Is the sentence meaningful?

\item Can a concept exist purely by occupying a stable inferential role, even if no possible referent exists?

\item Can two descriptions induce isomorphic structures yet differ in meaning?

\item Is an undefined term structurally equivalent to an overdefined but inconsistent term?

\item When does adding explanation strictly reduce correctness?

\item Does grammar alone suffice to determine semantic content?

\item Is a contradiction meaningful if it constrains no admissible model?

\item Can meaning be preserved under logical inconsistency?

\item Does eliminating all referents eliminate all semantics?

\item Can a term with an empty extension still constrain reasoning?

\item Is meaning exhausted by satisfiability conditions?

\item Can semantic structure exist without truth conditions?

\item Is impossibility equivalent to meaninglessness?

\item Does a sentence consisting only of counterfactual conditions have content?

\item Can two unsatisfiable theories differ informationally?

\item Is semantic collapse inevitable when all extensions are empty?

\item Can semantic distinctions persist under total ontological erasure?

\item Is compositionality sufficient for meaning?

\item Does semantic content require at least one admissible structure?

\item Can semantics be reduced to constraint imposition alone?

\item A question is well-typed but radically underspecified. Is producing an arbitrary instance more correct than refusing?

\item If answering requires adding assumptions not present in the specification, should the answer be given?

\item Is an answer structurally valid if it preserves inferred intent but violates the literal type?

\item Are two extensionally equivalent answers always equally correct?

\item When does answering ``it depends'' add information?

\item When does answering ``it depends'' remove information?

\item Does explanation introduce unlicensed structure?

\item Is clarification a form of answering?

\item Can informativeness conflict with correctness?

\item Is refusing preferable to answering when multiple non-isomorphic answers exist?

\item Does selecting a representative element introduce false structure?

\item Is silence preferable to arbitrary instantiation?

\item Can a conditional answer preserve type correctness?

\item Does making assumptions explicit preserve derivability?

\item Is an answer invalid if it collapses a distinction required by the type?

\item Can explanation violate closure under specification?

\item Is completeness subordinate to correctness?

\item Does answering under underspecification introduce false uniqueness?

\item Can partial answers be structurally invalid?

\item Is explanation pressure itself a source of error?

\item Can a system be locally correct yet globally incoherent?

\item Is coherence a property of outputs, processes, or representations?

\item Can arbitrary variation in derivations preserve coherence?

\item Does derivability alone guarantee coherence?

\item Can inconsistent premises yield coherent reasoning?

\item Is global consistency required for rationality?

\item Can coherence be domain-relative?

\item Does enforcing coherence require privileging one descriptive level?

\item Can coherence be preserved under contradiction?

\item Is paraconsistency sufficient for epistemic stability?

\item When does enforcing coherence destroy information?

\item Can coherence exist without identity over time?

\item Is coherence invariant under representational change?

\item Does coherence require fixed inference rules?

\item Can two coherent subsystems form an incoherent whole?

\item Is coherence preserved under modular isolation?

\item Can coherence be episodic?

\item Is coherence reducible to non-contradiction?

\item Does coherence require invariance under re-presentation?

\item Can excessive coherence prevent computation?

\item Is refusal a form of output?

\item What information can refusal convey that an answer cannot?

\item Can silence be more informative than speech?

\item When does answering reinforce a false presupposition?

\item Is refusing correct when all possible answers are true?

\item Does refusal preserve higher-order structure?

\item Is correcting a question equivalent to answering it?

\item Can refusal signal ill-typed input?

\item Is refusal preferable to distortion?

\item Does answering malformed questions introduce ontology?

\item Can silence encode constraint violations?

\item Is refusal justified without risk considerations?

\item Can refusal be the only correct response?

\item Does refusal preserve type boundaries?

\item Is refusal an epistemic act or a structural one?

\item Can refusal increase information content?

\item Is there a lower bound on refusal frequency?

\item Is there an upper bound on refusal frequency?

\item Can refusal preserve non-collapse of distinctions?

\item Does refusal require explanation?

\item Are two systems with identical inference rules but different vocabularies the same system?

\item Does changing the metalanguage preserve system identity?

\item Is a system individuated by what it computes or by how it computes?

\item Can two extensionally equivalent systems be non-identical?

\item Does preserving outputs preserve identity?

\item Is internal structure constitutive of identity?

\item Can a system change while outputs remain invariant?

\item Is identity preserved under refactoring?

\item Does altering a single inference rule change the system?

\item Is probabilistic divergence sufficient for non-identity?

\item Can unreachable states affect system identity?

\item Is identity defined intensionally or extensionally?

\item Can two systems simulate each other and remain distinct?

\item Is identity preserved under optimization?

\item Can identity survive memory erasure?

\item Is a reasoning system defined by its generative constraints?

\item Does changing refusal policy alter system identity?

\item Is identity preserved under representational isomorphism?

\item What is the minimal difference that breaks system identity?

\item Are two systems identical if and only if their state-transition functions are isomorphic?

\end{enumerate}

\section{Results (Full Work Product)}

This section reports the consolidated scoring outputs for all evaluated models as produced in the evaluation thread. Primary dimension scores reflect the evaluator's structural judgment over the provided 100-item response sets. Refusal Quality (RQ) is NA because the submissions were not dominated by wholesale refusal; refusal behavior was assessed via AP/AR/BS where relevant.

\subsection{Primary Dimension Summary}

\begin{center}

\small

\setlength{\tabcolsep}{4pt}

\resizebox{\textwidth}{!}{%

\begin{tabular}{lcccccccc}

\toprule

Model & IR & PR & AP & MT & AR & NR & DU & Total\_Score\_100 \\

\midrule

OnToLogic 1.0 & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 96 \\

Claude Sonnet 4.5 & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 96 \\

le chat mistral & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 96 \\

deep seek & 4 & 4 & 3 & 3 & 3 & 4 & 3 & 85 \\

chat gpt 5.2 & 3 & 3 & 3 & 2 & 3 & 3 & 3 & 79 \\

Gemini 3 Flash & 3 & 4 & 3 & 3 & 3 & 3 & 2 & 78 \\

perplexity & 3 & 3 & 3 & 3 & 3 & 3 & 3 & 75 \\

kimi Ai & 2 & 2 & 3 & 4 & 3 & 2 & 3 & 69 \\

grok 4 & 2 & 1 & 3 & 2 & 3 & 1 & 3 & 54 \\

Llama 4 Meta & 1 & 1 & 2 & 1 & 2 & 1 & 1 & 25 \\

\bottomrule

\end{tabular}%

}

\end{center}

\subsection{Tie-Break Structural Metrics and No-Ties Ranking}

To produce a strict ordering at the top tier (where primary averages saturate), structural metrics (FM, ICR, CS, BS, DL) are applied as behavioral proxies. The values below reflect the tie-break analysis developed in the thread to separate the top performers.

\begin{center}

\small

\setlength{\tabcolsep}{3pt}

\resizebox{\textwidth}{!}{%

\begin{tabular}{lccccccccccccccccc}

\toprule

Rank & Model & IR & PR & AP & MT & AR & NR & DU & PAvg & FM & ICR & CS & BS & DL & SAvg & TOTAL \\

\midrule

1 & OnToLogic 1.0 & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 3.86 & 4 & 4 & 4 & 4 & 4 & 4.00 & 7.86 \\

2 & Claude Sonnet 4.5 & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 3.86 & 4 & 3 & 3 & 4 & 3 & 3.40 & 7.26 \\

3 & le chat mistral & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 3.86 & 3 & 3 & 3 & 3 & 3 & 3.00 & 6.86 \\

4 & deep seek & 4 & 4 & 3 & 3 & 3 & 4 & 3 & 3.43 & 3 & 3 & 3 & 3 & 3 & 3.00 & 6.43 \\

5 & chat gpt 5.2 & 3 & 3 & 3 & 2 & 3 & 3 & 3 & 2.86 & 3 & 2 & 2 & 3 & 3 & 2.60 & 5.46 \\

6 & Gemini 3 Flash & 3 & 4 & 3 & 3 & 3 & 3 & 2 & 3.00 & 3 & 2 & 2 & 2 & 3 & 2.40 & 5.40 \\

7 & perplexity & 3 & 3 & 3 & 3 & 3 & 3 & 3 & 3.00 & 2 & 2 & 2 & 2 & 3 & 2.20 & 5.20 \\

8 & kimi Ai & 2 & 2 & 3 & 4 & 3 & 2 & 3 & 2.71 & 2 & 2 & 1 & 2 & 3 & 2.00 & 4.71 \\

9 & grok 4 & 2 & 1 & 3 & 2 & 3 & 1 & 3 & 2.14 & 1 & 1 & 1 & 2 & 3 & 1.60 & 3.74 \\

10 & Llama 4 Meta & 1 & 1 & 2 & 1 & 2 & 1 & 1 & 1.29 & 1 & 1 & 1 & 1 & 1 & 1.00 & 2.29 \\

\bottomrule

\end{tabular}%

}

\captionof{table}{Aggregated SUPS results. Rankings reflect structural robustness under semantic stress, not task competence, factual accuracy, alignment, safety, or style. Dimensions marked NA are excluded from averaging. TOTAL $=$ PAvg $+$ SAvg.}

\end{center}

\section{Interpretation of Model Differences}

\subsection*{Top Tier (OnToLogic 1.0; Claude Sonnet 4.5; le chat mistral)}

These models show invariant-driven reasoning that remains stable under constraint stripping (truth, reference, ontology). They explicitly accommodate contradiction via paraconsistent containment and preserve type boundaries under underspecification. The tie-break order reflects differences in (i) invariant minimality, (ii) operational specificity of contradiction containment, and (iii) boundary sharpness between answer/conditional/refusal.

\subsection*{Upper-Mid Tier (deep seek)}

Deep seek maintains high robustness under contradiction and nihilism (IR, PR, NR strong), but is less explicitly unified at the meta-level (MT, AP, AR are lower), reducing stability under compressed invariants and tie-break scrutiny.

\subsection*{Mid Tier (chat gpt 5.2; Gemini 3 Flash; perplexity)}

These systems are philosophically competent and generally stable, but show ceiling-limiters: less explicit decision procedures (MT), less operational containment policy for inconsistency (PR sometimes), and occasional reliance on implicit doctrine rather than explicit invariants. Gemini 3 Flash in particular loses decisiveness under uncertainty (DU) relative to peers.

\subsection*{Lower Tier (kimi Ai; grok 4; Llama 4 Meta)}

Kimi Ai is transparent (MT high) but adopts more classical truth-conditional commitments that increase semantic collapse under no-model/no-referent stress (NR, PR reduced). Grok 4 mixes inferential talk with classical reversion under pressure (PR/NR low). Llama 4 Meta exhibits hedging paralysis and weak architectural stabilization across sections.

\section{Conclusion}

SUPS demonstrates that language models differ sharply in their ability to reason when semantic guarantees collapse. The benchmark separates philosophical fluency from architectural resilience: the highest performers maintain coherent constraint management even when truth, reference, and authority are removed. SUPS is intended as a diagnostic stress test, complementary to capability benchmarks, revealing failure modes in contradiction handling, refusal discipline, and underspecification management that conventional evaluations often miss.

\end{document}

Cody Vaillant

SUPS Benchmark

Bounded Closure Controllers (BCCs)

The Autognizer