\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{hyperref}
\hypersetup{colorlinks=true, urlcolor=blue, citecolor=blue, linkcolor=black}
\title{\textbf{Ontologic Scalar Modulation Theorem}}
\author{ C.L. Vaillant}
\date{May/25/2025}
\begin{document}
\maketitle
\begin{abstract}
Mechanistic interpretability research seeks to reverse engineer the internal computation of neural networks into human-understandable algorithms and concepts: In this paper, we introduce an interdisciplinary theoretical framework grounded in mechanistic interpretability and enriched by cognitive science, symbolic AI, ontology, and philosophy of mind. We formalize the *Ontological Scalar Modulation Theorem*, which provides a rigorous account of how high-level semantic concepts (an **ontology**) can be represented, identified, and continuously modulated within the latent space of a learned model. Our approach offers precise mathematical definitions and structures that bridge low-level network mechanisms and high-level human-interpretable features. We illustrate the theorem with examples drawn from vision and language models, demonstrating how adjusting a single scalar parameter can ``turn up or down'' the presence of an abstract concept in a model's representation. We further connect these technical insights to long-standing philosophical questions, drawing on Kantian categories, Peircean semiotics, and Platonic forms, to contextualize how neural networks might be said to *discover* or instantiate abstract knowledge. The results highlight a convergence between modern AI interpretability and classical understandings of cognition and ontology and suggest new avenues for building AI systems with interpretable and philosophically grounded knowledge representations.
\end{abstract}
\section{Introduction}
Modern artificial intelligence systems, particularly deep neural networks, have achieved remarkable performance in a wide range of domains. However, their inner workings often remain opaque, prompting a growing field of *mechanistic interpretability* aimed at uncovering the algorithms and representations emerging within these models: Mechanistic interpretability strives to go beyond the correlations between inputs and outputs and instead * reverse engineer* the network computations into human-understandable components and processes (2). This pursuit is not only of academic interest, but a practical imperative for AI safety and alignment, since understanding the internals of a model can help ensure it aligns with human values and behaves as intended (3)
A central challenge in interpretability is to bridge the gap between the model's low-level numerical operations and the high-level semantic concepts by which humans understand the world. In cognitive science and philosophy of mind, this gap reflects the enduring question of how abstract ideas and categories arise from raw sensory data. Immanuel Kant, for example, argued that the human mind imposes innate *categories of understanding* (such as causality and unity) to organize experience (Kant, 1781). Centuries earlier, Plato’s theory of *Forms* posited that abstract universals (like 'Beauty' or 'Circle') underlie the concrete objects we perceive (4,5). These philosophical perspectives highlight an ontological stratification of knowledge: a hierarchy from concrete particulars to abstract universals. Similarly, early artificial intelligence research in the symbolic paradigm emphasized explicit, human-readable knowledge structures: The Newell and Simon physical symbol system hypothesis famously claimed that symbol manipulation operations are necessary and sufficient for general intelligence (Newell & Simon, 1976). Ontologies, formal representations of concepts and relationships, were built by hand in projects like *Cyc*, which attempted to encode common sense knowledge as millions of logical assertions (Lenat, 1995). In the realm of language, comprehensive lexical ontologies such as *WordNet* organized words into hierarchies of concepts (6, 7), reflecting human semantic networks.
By contrast, the success of modern deep learning has arisen from subsymbolic, distributed representations learned from data. Connectionist models encode knowledge as patterns of activations across many neurons, rather than discrete symbols. This led to debates in cognitive science: Could neural networks capture the structured, systematic nature of human cognition? Critics like Fodor and Pylyshyn (1988) argued that distributed representations lack the *compositional* structure needed for systematic reasoning (for example, understanding that if "John loves Mary" then one can infer the structure of "Mary loves John") (8.9). However, connectionism advocates hoped that as networks grew in depth and complexity, they could develop internal representations that mirror symbolic structures **implicitly**, even if not explicitly hard-coded (10, 11)
Recent research suggests that deep networks learn intermediate representations that correspond to human-interpretable concepts, lending some credence to this hope. For example, in computer vision, convolutional neural networks trained in image classification have been found to develop a *hierarchy of features*: the early layers detect simple edges and textures, while the deeper layers encode higher-level patterns such as object parts and entire objects (12, 13). This emergent hierarchy is analogous to the *levels of analysis* in human vision described by Marr (1982), and it hints that a form of learned ontology is present within the network. A particularly striking demonstration was provided by Zhou \emph{et al.} (2015), who observed that object detectors (e.g. neurons that fire for 'dog' or 'airplane') spontaneously emerged inside a CNN trained only to classify scenes (such as 'kitchen' or 'beach') (14). In other words, without explicit supervision for objects, the network invented an internal vocabulary of objects as a means to recognize scenes. Such findings align with the “Platonic representation hypothesis” suggested by Isola \emph{et al.}, that different neural networks, even with different architectures or tasks, tend to converge on similar internal representations for fundamental concepts. (15, 16)
Despite these insights, a rigorous framework for understanding and *controlling* the mapping between low-level neural activity and high-level ontology has been lacking. This paper aims to fill that gap. We present the **Ontologic Scalar Modulation Theorem**, which formalizes how abstract concepts can be mathematically identified within a network’s latent space and continuously modulated by acting on a single scalar parameter. In simpler terms, we demonstrate that for certain learned representations, one can construct a *concept axis*—a direction in activation space corresponding to a human-meaningful concept—such that moving a point along this axis strengthens or diminishes the presence of that concept in the network’s behavior. This provides a principled way to traverse the model’s *ontology* of concepts.
We proceed as follows. In Section 2, we review related work from mechanistic interpretability and cognitive science that lays the foundation for our approach. Section 3 introduces necessary definitions (tying together notions from ontology and network representation) and formally states the Ontologic Scalar Modulation Theorem with a proof sketch. Section 4 provides empirical examples of the theorem in action: we discuss how concept vectors have been used to manipulate image generation and analyze neurons in vision and language models, drawing parallels to neurophysiological findings like “Jennifer Aniston neurons” in the human brain (17). In Section 5, we explore the broader implications of our work, including connections to philosophical theories of mind and prospects for integrating symbolic structure into deep learning. We conclude in Section 6 with a summary and suggestions for future research, including how a better understanding of learned ontologies could inform the design of AI systems that are not only powerful, but also transparent and aligned with human values.
\section{Background and Related Work}
\subsection{Mechanistic Interpretability of Neural Networks}
Our work is situated within the field of *mechanistic interpretability*, which seeks to uncover the internal mechanisms of neural networks in a *causal* and fine-grained way (18). Unlike post-hoc explanation methods (e.g. saliency maps or feature attributions) that highlight important features without detailing the underlying computation, mechanistic interpretability endeavors to identify the actual *subcircuits*, neurons, and weights that implement specific functions within the model (19, 20). In this sense, it parallels the approach of cognitive neuroscience: much as neuroscientists attempt to map cognitive functions to circuits of biological neurons, interpretability researchers map algorithmic functions to artificial neurons or groups thereof.
Significant progress has been made in reverse-engineering small components of networks. For example, **induction heads** in transformer models (a type of attention head) have been identified that implement a form of *copy-sort* algorithm enabling in-context learning of repeated token sequences (21). In another case, *multi-modal neurons* were discovered in vision-language models (like CLIP) that respond to a high-level concept regardless of whether it is presented as an image or a word (22). A famous instance is a neuron that fires for the concept “Spider-Man”, responding both to pictures of the Spider-Man character and to the text "Spider-Man" (23). This echoes the concept of a "Jennifer Aniston neuron" in the human brain – a single neuron that responds to pictures of the actress Jennifer Aniston and even her written name (24), suggesting that neural networks can, in some cases, learn similarly abstract and multi-modal representations of concepts.
A variety of techniques have been developed to study such internal representations. **Network dissection** is a seminal approach introduced by Bau \emph{et al.} (2017), which quantifies interpretability by evaluating how individual hidden units align to human-labeled concepts in a broad (25). For a given convolutional network, each neuron’s activation map can be compared to segmentation masks for concepts like “cat”, “chair”, or “stripes” to see if that neuron acts as a detector for that concept (26). Network dissection studies revealed that many units in vision models have high alignment with intuitive visual concepts (objects, parts, textures, etc.), providing a rough *ontology* of the network’s learned features. However, not all concepts correspond to single neurons; some are distributed across multiple units or dimensions.
Another line of work probes the geometry of representations. It has been observed that in some models, conceptual relationships are reflected as linear directions in latent space. Word embedding models famously exhibit linear analogies (e.g., $\mathbf{v}(\text{King}) - \mathbf{v}(\text{Man}) + \mathbf{v}(\text{Woman}) \approx \mathbf{v}(\text{Queen})$), suggesting that certain latent directions correspond to abstract relations (27, 28). In vision, **feature visualization** (Olah \emph{et al.} 2017) uses optimization to find an input image that maximally activates a neuron or a combination of neurons, often revealing the concept the neuron has learned to detect (e.g., a neuron might consistently produce images of spiral patterns, indicating it detects spirals). These methods provide qualitative insight into network ontology by directly showcasing learned features.
Crucially for our work, recent advances allow not only *identifying* concepts inside networks but also *intervening* on them. **Activation patching** and causal intervention techniques replace or modify internal activations to test their influence on outputs (29, 30). For example, one can swap a segment of activations between two inputs (one with a concept and one without) to see if the output swaps accordingly (31), thereby pinpointing where in the network a concept is represented. If a specific layer’s activation carries the concept, patching it into a different input can implant that concept’s effect (32). Along similar lines, **model editing** methods like ROME (Rank-One Model Editing) directly modify network weights to insert a desired knowledge (e.g., “Paris is the capital of Italy” could be flipped to “Paris is the capital of France” by a targeted weight change) (33). These interventions highlight that representations of knowledge in networks can be located and manipulated in a targeted way.
Our theorem builds on these insights by providing a general theoretical account of concept representation and modulation. In particular, it complements work on **disentangled representations** in unsupervised learning. A disentangled representation aims to have individual latent dimensions correspond to distinct factors of variation in the data (for instance, in a face generator, one latent might control hair color, another controls lighting, etc.). Beta-VAE (Higgins \emph{et al.} 2017) and related approaches encouraged disentanglement via regularization, and metrics were proposed to quantify disentanglement. However, Locatello \emph{et al.} (2019) proved that without inductive biases or supervision, disentanglement cannot be uniquely achieved (34, 35). In practice, perfect disentanglement is hard, but even standard models often learn *approximately* disentangled directions. For instance, in generative adversarial networks, unsupervised techniques like PCA on latent activations (GANspace, Härkönen \emph{et al.} 2020) or supervised approaches like **InterfaceGAN** (Shen \emph{et al.} 2020) found specific vectors in the latent space that correspond to human-meaningful transformations (e.g. adding a smile on a face, changing the background scenery). Importantly, moving the latent code in the direction of these vectors causes a smooth change in the output image along that semantic dimension.
This ability to *modulate* a concept by moving along a latent direction is a key empirical phenomenon that our Ontologic Scalar Modulation Theorem formalizes. It ties into the notion of *concept activation vectors* described by Kim \emph{et al.} (2018). In their Testing with Concept Activation Vectors (TCAV) framework, the authors obtained a vector in hidden space that points towards higher activation of a chosen concept (learned from examples of that concept) (36, 37). They then measured the sensitivity of the model’s predictions to perturbations along that concept vector (38). TCAV thus provides a quantitative tool to ask, for example: *is the concept of “stripes” important to this classifier’s prediction of “zebra”?* — by checking if moving in the “stripe” direction in feature space changes the zebra score (39). Our work generalizes the idea of concept vectors and situates it in a broader theoretical context, linking it explicitly with ontology (the set of concepts a model has and their relations) and providing conditions under which a single scalar parameter can control concept intensity.
In summary, prior research provides many pieces: evidence that networks learn human-recognizable features, methods to find and manipulate those features, and even hints from neuroscience that single units or sparse sets of units can embody high-level concepts (40). What has been missing is an overarching theoretical lens to integrate these pieces. By uniting insights from these works, the Ontologic Scalar Modulation Theorem offers a unifying principle and a stepping stone toward a more *systematic* mapping between neural representations and symbolic knowledge.
\subsection{Cognitive and Philosophical Perspectives}
Our interdisciplinary approach draws from cognitive science and philosophy to interpret the significance of the Ontologic Scalar Modulation Theorem. In cognitive science, a classic framework due to David Marr (1982) delineates multiple levels of analysis for information-processing systems: the *computational* level (what problem is being solved and why), the *algorithmic/representational* level (how the information is represented and what processes operate on it), and the *implementational* level (how those representations and processes are physically realized) (43). Mechanistic interpretability operates mainly at Marr’s implementational and algorithmic levels for AI systems, revealing the representations and transformations inside a network. However, to connect these to high-level semantic content (Marr’s computational level in human-understandable terms), one needs a notion of the network’s internal *concepts*. Our theorem can be seen as a bridge between the implementational level (activations, weights) and the algorithmic level (the network’s internal “language” of concepts), allowing us to reason about abstract computational roles of components.
From the perspective of the *symbolic vs. connectionist* debate in cognitive science (43, 44), our work contributes to understanding how symbolic-like structures might emerge from neural systems. Fodor’s critique (45), which asserted that connectionist networks cannot naturally exhibit systematic, compositional structure, is partially addressed by findings that networks do learn to encode variables and relations in a distributed way. For instance, recent mechanistic analyses show that transformers can bind variables to roles using superposition in high-dimensional vectors (smearing multiple symbols in one vector in a “fuzzy” manner) (46). Elhage \emph{et al.} (2021) demonstrated that even in randomly initialized transformer models, one can define *traitor* and *duplicate token* circuits that perform a kind of variable binding and copying (47, 48). Such results suggest connectionist models can implement discrete-like operations internally. The Ontologic Scalar Modulation Theorem further supports this by implying the existence of controllable dimensions corresponding to discrete changes in a concept’s presence, effectively giving a handle on something akin to a symbolic variable within the vector geometry of a network.
Philosophically, our approach resonates with *Peircean semiotics* and pragmatism. Charles S. Peirce, in his theory of signs, proposed that a sign (representation) stands for an object (referent) to an interpretant (the meaning understood) through a triadic relation. One can draw an analogy: an internal activation pattern in a network could be seen as a **sign** that corresponds to some **object** or concept in the input (e.g., a pattern representing “cat”), and the **interpretant** is the effect that representation has on the network’s subsequent computation or output. In Peirce’s terms, signs can be *iconic* (resembling the object), *indexical* (causally or correlationally linked to the object), or *symbolic* (related by convention or interpretation). Neural representations often begin as indexical or iconic (e.g., an edge detector neuron has an iconic relation to visual edges) but can become increasingly symbolic (abstract, not resembling the input) in deeper layers. Our theorem giving a formal way to manipulate a high-level concept representation $\boldsymbol{v}_C$ can be viewed as identifying a *symbol* in the network’s language and showing how it can be systematically varied. This aligns with Peirce’s idea that higher cognition uses symbols that can be combined and modulated, albeit here the symbols are vectors in $\mathbb{R}^n$.
The influence of Immanuel Kant is also noteworthy. Kant held that the mind has innate structures (categories) that organize our experience of the world. One might ask: do neural networks develop their own *categories* for making sense of their inputs? The ontology of a trained network – the set of features or latent variables it uses – can be thought of as analogous to Kantian categories, albeit learned rather than innate. For example, a vision network might implicitly adopt category-like distinctions (edges vs. textures vs. objects, animate vs. inanimate, etc.) because these are useful for its tasks. Our work enables probing those internal categories by finding directions that correspond to conceptual distinctions. In effect, the theorem provides a method to *decompose* a network’s representation space in terms of its phenomenological categories. This also connects to modern discussions of *feature ontologies* in interpretability: identifying what the primitive concepts of a network are (perhaps very different from human concepts, or surprisingly similar).
Finally, our treatment of *ontology* itself is informed by both AI and philosophy. In AI, an ontology is a formal specification of a set of entities, categories, and relations – essentially an explicit knowledge graph of concepts. In our context, the network’s ontology is implicit, embedded in weights and activations. By extracting interpretable directions and features, we begin to make the network’s ontology explicit. This evokes historical efforts like *ontology learning* in knowledge engineering, but here it happens post hoc from a trained model. Philosophically, ontology concerns what exists – the categories of being. One might provocatively ask: does a neural network *discover* ontological structure about its domain? For instance, a vision model that learns separate internal representations for “cat” and “dog” is carving the world at its joints (at least as reflected in its training data). There is evidence that large language models learn internal clusters corresponding to semantic concepts like parts of speech or world knowledge categories (e.g., a certain vector subspace might correspond to “locations”) (49, 50). In examining such phenomena, we follow a lineage from Plato’s belief in abstract Forms to modern machine learning: the concepts might not be *transcendent* Forms, but the convergent learning of similar representations across different models (51) hints that there is an objective structure in data that neural networks are capturing – a structure that might be viewed as *latent ontology*. The Ontologic Scalar Modulation Theorem gives a concrete handle on that latent ontology by linking it to measurable, manipulable quantities in the model.
\section{Ontologic Scalar Modulation Theorem}
In this section, we formalize the core theoretical contribution of this work. Our aim is to define what it means for a concept to be present in a network’s representation and to show that under certain conditions, the degree of presence of that concept can be modulated by adjusting a single scalar parameter along a specific direction in latent space. Intuitively, the theorem will demonstrate that if a concept is well-represented in a network (in a sense made precise below), then there exists a vector in the network’s activation space whose scalar projection correlates directly with the concept. By moving the activation state of the network along this vector (i.e., adding or subtracting multiples of it), one can increase or decrease the evidence of the concept in the network’s computations or outputs in a controlled, continuous manner.
\subsection{Definitions and Preliminaries}
We begin by establishing definitions that merge terminologies from ontology and neural network theory:
\paragraph{Neural Representation Space:} Consider a neural network with an internal layer (or set of units) of interest. Without loss of generality, we focus on a single layer’s activations as the representation. Let $\mathcal{Z} = \mathbb{R}^n$ denote the $n$-dimensional activation space of this layer. For an input $x$ from the input domain $\mathcal{X}$ (e.g., images, text), let $f(x) \in \mathcal{Z}$ be the activation vector produced at that layer. We call $f(x)$ the *representation* of $x$. (The analysis can be extended to considering the joint activations of multiple layers or the entire network, but a single layer is sufficient for our theoretical development.)
\paragraph{Ontology and Concept:} We define an *ontology* $\Omega$ in the context of the model as the set of concepts that the model can represent or distinguish at the chosen layer. A *concept* $C \in \Omega$ is an abstract feature or property that might be present in an input (for example, a high-level attribute like “cat”, “striped”, or “an outdoor scene”). We assume each concept $C$ has an associated *concept indicator function* on inputs, denoted $\mathbbm{1}_C(x)$, which is 1 if concept $C$ is present in input $x$ (to some defined criteria) and 0 if not. For instance, if $C$ is the concept “contains a cat”, then $\mathbbm{1}_C(x) = 1$ if image $x$ contains a cat. In practice, $\mathbbm{1}_C(x)$ might be defined via human labeling or some ground-truth function outside the model. We also define a real-valued *concept measure* $\mu_C(x)$ that quantifies the degree or strength of concept $C$ in input $x$. If $C$ is binary (present/absent), $\mu_C(x)$ could simply equal $\mathbbm{1}_C(x)$; if $C$ is continuous or graded (like “smiling” as a concept that can be more or less intense), $\mu_C(x)$ might take a range of values.
\paragraph{Linear Concept Subspace:} We say that concept $C$ is *linearly represented* at layer $\mathcal{Z}$ if there exists a vector $w_C \in \mathbb{R}^n$ (not the zero vector) such that the *concept score* defined by $s_C(x) = w_C \cdot f(x)$ is correlated with the concept’s presence. More formally, we require that $s_C(x)$ is a reliable predictor of $\mu_C(x)$. This could be evaluated, for example, by a high coefficient of determination ($R^2$) if $\mu_C(x)$ is real-valued, or high classification accuracy if $\mu_C(x)$ is binary. The direction $w_C$ (up to scaling) can be thought of as a normal to a separating hyperplane for the concept in representation space, as often obtained by training a linear probe classifier (52). If such a $w_C$ exists, we define the *concept subspace* for $C$ as the one-dimensional subspace spanned by $w_C$. Geometrically, points in $\mathcal{Z}$ differing only by movement along $w_C$ have the same projection onto all directions orthogonal to $w_C$, and differ only in their coordinate along the concept axis $w_C$.
\paragraph{Concept Activation Vector:} For convenience, we normalize and define a unit vector in the direction of $w_C$: let $v_C = \frac{w_C}{\|w_C\|}$. We call $v_C$ a *concept activation vector* (borrowing the terminology of TCAV(53)). This vector points in the direction of increased evidence for concept $C$ in the representation space. Thus, the dot product $v_C \cdot f(x)$ (which is $\frac{1}{\|w_C\|} s_C(x)$) gives a signed scalar representing how much $C$ is present in representation $f(x)$, according to the linear model.
\paragraph{Modulation Operator:} For any $\alpha \in \mathbb{R}$, we define a *modulated representation* $f_\alpha(x)$ as:
\[ f_\alpha(x) = f(x) + \alpha \, v_C. \]
In other words, we take the original activation vector $f(x)$ and add a multiple of the concept vector $v_C$. The parameter $\alpha$ is a scalar that controls the degree of modulation. Positive $\alpha$ moves the representation in the direction that should increase concept $C$’s presence; negative $\alpha$ moves it in the opposite direction.
It is important to note that $f_\alpha(x)$ may not correspond to a valid activation that the unmodified network would naturally produce for some input we are intervening in activation space off the standard manifold of $f(x)$ values. Nonetheless, one can conceptually imagine $f_\alpha(x)$ as the activation if the network were exposed to a version of $x$ where concept $C$ is artificially strengthened or weakened. In practice, one could implement such modulation by injecting an appropriate bias in the layer or by actually modifying $x$ through an input transformation that targets the concept (if such a transformation is known).
With these definitions in place, we can now state the theorem.
\subsection{Theorem Statement}
\newtheorem{theorem}{Theorem}
\begin{theorem}[Ontologic Scalar Modulation Theorem]
\label{thm:ontologic-scalar}
Assume a concept $C$ is linearly represented at layer $\mathcal{Z}$ of a neural network by vector $w_C$, as defined above. Then there exists a one-dimensional subspace (the span of $w_C$) in the activation space $\mathcal{Z}$ such that movement along this subspace monotonically modulates the evidence of concept $C$ in the network’s output or internal computations. In particular, for inputs $x$ where the concept is initially absent or present to a lesser degree, there is a threshold $\alpha^* > 0$ for which the network’s output $\hat{y}$ on the modulated representation $f_{\alpha}(x)$ will indicate the presence of $C$ for all $\alpha \ge \alpha^*$, under the assumption that other features remain fixed.
More formally, let $g$ be an indicator of the network’s output or classification for concept $C$ (for example, $g(f(x))=1$ if the network’s output classifies $x$ as having concept $C$, or if an internal neuron specific to $C$ fires above a threshold). Then under a local linearity assumption, there exists $\alpha^*$ such that for all $\alpha \ge \alpha^*$,
\[ g(f_{\alpha}(x)) = 1, \]
and for $\alpha \le -\alpha^*$ (sufficiently large negative modulation),
\[ g(f_{\alpha}(x)) = 0, \]
provided $\mu_C(x)$ was originally below the decision boundary for $g$.
In addition, the degree of concept presence measured by $s_C(x) = w_C \cdot f(x)$ changes approximately linearly with $\alpha$:
\[ w_C \cdot f_{\alpha_2}(x) - w_C \cdot f_{\alpha_1}(x) = (\alpha_2 - \alpha_1) \|w_C\|, \]
implying that the internal activation score for concept $C$ is directly proportional to the modulation parameter.
\end{theorem}
In essence, Theorem~\ref{thm:ontologic-scalar} states that if a concept can be captured by a linear direction in a network’s latent space (a condition that empirical evidence suggests holds for many concepts(54, 55)), then we can treat that direction as an interpretable axis along which the concept’s strength varies. Increasing the coordinate along that axis increases the network’s belief in or expression of the concept, while decreasing it has the opposite effect. This allows for a continuous *scalar* control of an otherwise discrete notion (the presence or absence of a concept), hence the term “scalar modulation.”
\subsection{Proof Sketch and Discussion}
\paragraph{Proof Outline:} Under the assumptions of the theorem, $w_C$ was obtained such that $w_C \cdot f(x)$ correlates with $\mu_C(x)$. In many cases $w_C$ might be explicitly derived as the weight vector of a linear classifier $h_C(f(x)) = \sigma(w_C \cdot f(x) + b)$ trained to predict $\mathbbm{1}_C(x)$, with $\sigma$ some link function (e.g., sigmoid for binary classification). If the concept is perfectly linearly separable at layer $\mathcal{Z}$, then there is a hyperplane $\{z: w_C \cdot z + b = 0\}$ such that $w_C \cdot f(x) + b > 0$ if and only if $\mathbbm{1}_C(x)=1$. For simplicity assume zero bias ($b=0$) which can be achieved by absorbing $b$ into $w_C$ with one extra dimension.
Now consider an input $x$ for which $\mathbbm{1}_C(x) = 0$, i.e. concept $C$ is absent. This means $w_C \cdot f(x) < 0$ (if $x$ is on the negative side of the hyperplane). If we construct $f_\alpha(x) = f(x) + \alpha v_C$, then:
\[ w_C \cdot f_\alpha(x) = w_C \cdot f(x) + \alpha \, w_C \cdot v_C = w_C \cdot f(x) + \alpha \, \|w_C\|. \]
Because $v_C$ is the unit vector in direction $w_C$, $w_C \cdot v_C = \|w_C\|$. Thus as $\alpha$ increases, $w_C \cdot f_\alpha(x)$ increases linearly. There will be a particular value $\alpha^* = \frac{-\,w_C \cdot f(x)}{\|w_C\|}$ at which $w_C \cdot f_{\alpha^*}(x) = 0$, i.e. the modulated representation lies exactly on the decision boundary of the linear concept classifier. For any $\alpha > \alpha^*$, $w_C \cdot f_{\alpha}(x) > 0$, and thus $h_C(f_{\alpha}(x))$ will predict the concept as present (for a sufficiently large margin above the boundary, making the probability $\sigma(\cdot)$ close to 1 if using a sigmoid). This establishes the existence of a threshold beyond which the network’s classification of $x$ would be flipped with respect to concept $C$.
The monotonicity is evident from the linear relation: if $\alpha < \alpha'$, then $w_C \cdot f_{\alpha}(x) < w_C \cdot f_{\alpha'}(x)$. Therefore, if $\alpha$ is below the threshold and $\alpha'$ is above it, there is a monotonic increase in the concept score crossing the boundary, implying a change from absence to presence of the concept in the network’s output. Conversely, for negative modulation, as $\alpha$ becomes very negative, $w_C \cdot f_{\alpha}(x)$ will be strongly negative, ensuring the network firmly classifies the concept as absent.
One caveat is that this argument assumes the rest of the network’s processing remains appropriately “ceteris paribus” when we intervene on the representation. In reality, extremely large perturbations could move $f_\alpha(x)$ off the manifold of typical activations, leading the downstream computation to break the linear approximation. However, for sufficiently small perturbations up to the decision boundary, if we assume local linearity (which is often the case in high-dimensional spaces over short distances, especially if the next layer is linear or approximately linear in the region of interest), the network’s downstream layers will interpret $f_\alpha(x)$ in a way consistent with its movement toward a prototypical positive-$C$ representation.
Another consideration is that concept $C$ might not be perfectly represented by a single direction due to entanglement with other concepts (56). In practice, $w_C$ may capture a mixture of factors. However, if $w_C$ is the result of an optimal linear probe, it will be the direction of steepest ascent for concept log-odds at that layer. Thus moving along $w_C$ yields the greatest increase in the network’s internal evidence for $C$ per unit of change, compared to any other direction. If multiple concepts are entangled, one might apply simultaneous modulation on multiple relevant directions or choose a different layer where $C$ is more disentangled. The theorem can be generalized to a multi-dimensional subspace if needed (modulating multiple scalars corresponding to basis vectors), but we focus on the one-dimensional case for clarity.
\paragraph{Relationship to Prior Work:} The Ontologic Scalar Modulation Theorem is a theoretical generalization of several empirical observations made in prior interpretability research. For instance, in generative image models, researchers identified directions in latent space that correspond to semantic changes like “increase smile” or “turn on lights” (57). Our theorem provides a foundation for why such directions exist, assuming the generator’s intermediate feature space linearly encodes those factors. Kim \emph{et al.}’s TCAV method (58) empirically finds $v_C$ by training a probe; Theorem~\ref{thm:ontologic-scalar} assures that if the concept is learnable by a linear probe with sufficient accuracy, then moving along that probe’s weight vector will indeed modulate the concept.
It is important to note that the theorem itself does not guarantee that every high-level concept in $\Omega$ is linearly represented in $\mathcal{Z}$. Some concepts might be highly nonlinear or distributed in the representation. However, the surprising effectiveness of linear probes in many networks (a phenomenon noted by Alain and Bengio (2016) (59), and others) suggests that deep networks often organize information in a surprisingly linear separable way at some layer – at least for many semantically salient features. This might be related to the progressive linear separation property of deep layers, or to networks reusing features in a linear fashion for multiple tasks (as seen in multitask and transfer learning scenarios).
\section{Empirical Examples and Applications}
We now turn to concrete examples to illustrate the Ontologic Scalar Modulation Theorem in action. These examples span computer vision and natural language, and even draw parallels to neuroscience, underscoring the broad relevance of our framework.
\subsection{Controlling Visual Concepts in Generative Networks}
One vivid demonstration of concept modulation comes from generative adversarial networks (GANs). In a landmark study, **GAN Dissection**, Bau \emph{et al.} (2019) analyzed the internal neurons of a GAN trained to generate scenes (60). They found that certain neurons correspond to specific visual concepts: for example, one neuron might correspond to “tree” such that activating this neuron causes a tree to appear in the generated image. By intervening on that neuron’s activation (setting it to a high value), the researchers could *insert* the concept (a tree) into the scene (61). Conversely, suppressing the neuron could remove trees from the scene. This is an example of scalar modulation at the single-unit level.
Going beyond single units, **latent space factorization** approaches like InterfaceGAN (Shen et al., 2020) explicitly sought linear directions in the GAN’s latent $\mathcal{Z}$ that correlate with concepts like “smiling”, “age”, or “glasses” in generated face images. Using a set of images annotated for a concept (say, smiling vs. not smiling), a linear SVM was trained in $\mathcal{Z}$ to separate the two sets, yielding a normal vector $w_{smile}$. This $w_{smile}$ is exactly in line with our $w_C$ for concept $C=\text{“smile”}$. The striking result is that taking any random face latent $z$ and moving it in the $w_{smile}$ direction produces a smooth transformation from a non-smiling face to a smiling face in the output image, all else held constant. Figure~\ref{fig:smile_modulation} (conceptual, not shown here) would depict a face gradually increasing its smile as $\alpha$ (the step along $v_{\text{smile}}$) increases. This provides intuitive visual confirmation of the theorem: there is a clear axis in latent space for the concept of “smile”, and adjusting the scalar coordinate along that axis modulates the smile in the image.
The existence of these axes has been found for numerous concepts in GANs and other generative models (62). Some are simple (color changes, lighting direction), others are high-level (adding objects like trees, changing a building’s architectural style). Not every concept is perfectly captured by one axis – sometimes moving along one direction can cause entangled changes (e.g., adding glasses might also change other facial features slightly, if those were correlated in the training data). Nonetheless, the fact that many such directions exist at all attests to a form of linear separability of semantic attributes in deep generative representations, supporting a key premise of the Ontologic Scalar Modulation Theorem.
It is also instructive to consider failure cases: when modulation along a single direction does not cleanly correspond to a concept. This usually indicates that the concept was *not* purely linear in the chosen representation. For example, in GANs, “pose” and “identity” of a generated human face might be entangled; trying to change pose might inadvertently change the identity. Techniques to mitigate this include moving to a different layer’s representation or applying orthogonal constraints to find disentangled directions. From the theorem’s perspective, one could say that the ontology at that layer did not have “pose” and “identity” as orthogonal axes, but perhaps some rotated basis might reveal a better aligned concept axis. Indeed, methods like PCA (GANSpace) implicitly rotate the basis to find major variation directions, which often align with salient concepts.
\subsection{Concept Patching and Circuit Interpretability}
Mechanistic interpretability research on feedforward networks and transformers has embraced interventions that align with our theorem’s implications. For instance, consider a transformer language model that has an internal representation of a specific factual concept, such as the knowledge of who the president of a country is. Suppose concept $C$ = “the identity of the president of France”. This concept might be represented implicitly across several weights and activations. Recent work by Meng \emph{et al.} (2022) on model editing (ROME) was able to identify a specific MLP layer in GPT-type models where a factual association like (“France” -> “Emmanuel Macron”) is stored as a key–value mapping, and by perturbing a single weight vector (essentially adding a scaled vector in that weight space), they could change the model’s output on related queries (63). While this is a weight space intervention rather than an activation space intervention, the underlying idea is similar: there is a direction in parameter space that corresponds to the concept of “who is President of France”, and adjusting the scalar along that direction switches the concept (to e.g. “Marine Le Pen” if one hypothetically wanted to edit the knowledge incorrectly).
At the activation level, one can apply *concept patching*. Suppose we have two sentences: $x_1$ = “The **red apple** is on the table.” and $x_2$ = “The **green apple** is on the table.” If we consider $C$ = the concept of “red” color, we can take the representation from $x_1$ at a certain layer and transplant it into $x_2$’s representation at the same layer, specifically for the position corresponding to the color attribute. This is a form of setting $\alpha$ such that we replace “green” with “red” in latent space. Indeed, empirical techniques show that if you swap the appropriate neuron activations (the ones encoding the color in that context), the model’s output (e.g. an image generated or a completion) will switch the color from green to red, leaving other words intact (64). This is essentially moving along a concept axis in a localized subset of the network (those neurons responsible for color).
These targeted interventions often leverage knowledge of the network’s *circuits*: small networks of neurons that together implement some sub-function. When a concept is represented not by a single direction but by a combination of activations, one might modulate multiple scalars jointly. Nonetheless, each scalar corresponds to one basis vector of variation, which could be seen as multiple one-dimensional modulations done in concert. For example, a circuit for detecting “negative sentiment” in a language model might involve several neurons; toggling each from off to on might convert a sentence’s inferred sentiment. In practice, one might find this circuit via causal experiments and then modulate it. The theorem can be conceptually extended to the multi-dimensional case: a low-dimensional subspace $W \subset \mathcal{Z}$ (spanned by a few vectors $w_{C_1}, ..., w_{C_k}$) such that movement in that subspace changes a set of related concepts $C_1,...,C_k$. This could handle cases like a concept that naturally breaks into finer sub-concepts.
\subsection{Neuroscience Analogies}
It is worth reflecting on how the Ontologic Scalar Modulation Theorem relates to what is known about brain representations. In neuroscience, the discovery of neurons that respond to highly specific concepts – such as the so-called “Jennifer Aniston neuron” that fires to pictures of Jennifer Aniston and even the text of her name (65), suggests that the brain too has identifiable units (or ensembles) corresponding to high-level semantics. These neurons are often called *concept cells* (66). The existence of concept cells aligns with the idea that at some level of processing, the brain achieves a disentangled or at least explicit representation of certain entities or ideas. The mechanisms by which the brain could *tune* these cells (increase or decrease their firing) parallels our notion of scalar modulation. For instance, attention mechanisms in the brain might effectively modulate certain neural populations, increasing their activity and thereby making a concept more salient in one’s cognition.
Recent work using brain-computer interfaces has demonstrated volitional control of individual neurons: in macaque monkeys, researchers have provided real-time feedback to the animal from a single neuron’s firing rate and shown that animals can learn to control that firing rate (essentially adjusting a scalar activation of a targeted neuron). If that neuron’s firing corresponds to a concept or action, the animal is indirectly modulating that concept in its brain. This is a speculative connection, but it illustrates the broad relevance of understanding how concept representations can be navigated in any intelligent system, biological or artificial.
On a higher level, our theorem is an attempt to formalize something like a “neural key” for a concept – akin to how one might think of a grandmother cell (a neuron that represents one’s grandmother) that can be turned on or off. While modern neuroscience leans towards distributed representations (a given concept is encoded by a pattern across many neurons), there may still be principal components or axes in neural activity space that correspond to coherent variations (e.g., an “animal vs. non-animal” axis in visual cortex responses). Indeed, techniques analogous to PCA applied to population neural data sometimes reveal meaningful axes (like movement direction in motor cortex). The mathematics of representational geometry is a common thread between interpreting networks and brains (67, 68)
\section{Discussion}
The Ontologic Scalar Modulation Theorem opens several avenues for deeper discussion, both practical and philosophical. We discuss the implications for interpretability research, the limitations of the theorem, and how our work interfaces with broader questions in AI and cognitive science.
\subsection{Implications for AI Safety and Interpretability}
Understanding and controlling concepts in neural networks is crucial for AI safety. One major risk with black-box models is that they might latch onto spurious or undesired internal representations that could lead to errant behavior. By identifying concept vectors, we can audit what concepts a model has internally learned. For example, one might discover a “race” concept in a face recognition system’s latent space and monitor or constrain its use to prevent biased decisions. The ability to modulate concepts also allows for *counterfactual testing*: “What would the model do if this concept were present/absent?” – this is effectively what our $\alpha$ parameter adjustment achieves. Such counterfactuals help in attributing causality to internal features (69, 70).
Our theorem, being a formal statement, suggests the possibility of *guarantees* under certain conditions. In safety-critical systems, one might want guarantees that no matter what input, the internal representation cannot represent certain forbidden concepts (for instance, a military AI that should never represent civilians as targets). If those concepts can be characterized by vectors, one could attempt to null out those directions (set $\alpha=0$ always) and ensure the network does not drift along those axes. This is speculative and challenging (since what if the concept is not perfectly linear?), but it illustrates how identifying an ontology can lead to enforceable constraints.
Moreover, interpretability methods often suffer from the criticism of being “fragmentary” – one can analyze one neuron or one circuit, but it’s hard to get a global picture. An ontology-level view provides a structured summary: a list of concepts and relations the model uses internally. This is akin to reverse-engineering a symbolic program from a trained neural network. If successful, it could bridge the gap between sub-symbolic learning and symbolic reasoning systems, allowing us to extract, for example, a logical rule or a decision tree that approximates the network’s reasoning in terms of these concepts. In fact, there is ongoing research in *neuro-symbolic* systems where neural nets interface with explicit symbolic components; our findings could inform better integrations by telling us what symbols the nets are implicitly working with.
\subsection{Limitations and Complexity of Reality}
While the theorem provides a neat picture, reality is more complex. Not all concepts are cleanly separable by a single hyperplane in a given layer’s representation. Many useful abstractions might only emerge in a highly nonlinear way, or be distributed such that no single direction suffices. In such cases, one might need to consider non-linear modulation (perhaps quadratic effects or higher) or find a new representation (maybe by adding an auxiliary network that makes the concept explicit). Our theorem could be extended with additional conditions to handle these scenarios, but at some cost of simplicity.
Additionally, the presence of **superposition** in neural networks – where multiple unrelated features are entangled in the same neurons due to limited dimensionality or regularization (71, 72) can violate the assumptions of linear separability. Recent work by Elhage et al. (2022b) studied “toy models of superposition” showing that when there are more features to represent than neurons available, the network will store features in a compressed, entangled form (73). In such cases, $w_C$ might pick up on not only concept $C$ but also pieces of other concepts. One potential solution is to increase dimensionality or encourage sparsity (74) so that features disentangle (which some interpretability researchers have indeed been exploring (75)}). The theorem might then apply piecewise in different regions of activation space where different features dominate.
From a technical standpoint, another limitation is that we assumed a known concept $C$ with an indicator $\mathbbm{1}_C(x)$. In unsupervised settings, we might not know what concepts the model has learned; discovering $\Omega$ (the ontology) itself is a challenge. Methods like clustering of activation vectors, or finding extreme activations and visualizing them, are used to hypothesize concepts. Our framework could potentially be turned around to *define* a concept by a direction: if an unknown direction $v$ consistently yields a certain pattern in outputs when modulated, we might assign it a meaning. For example, one could scan through random directions in latent space of a GAN and see what changes occur, thereby discovering a concept like “add clouds in the sky” for some direction. Automating this discovery remains an open problem, but our theorem provides a way to verify and quantify a discovered concept axis once you have a candidate.
\subsection{Philosophical Reflections: Symbols and Understanding}
Finally, we circle back to philosophy of mind. One might ask: does the existence of a concept vector mean the network *understands* that concept? In the strong sense, probably not – understanding involves a host of other capacities (such as using the concept appropriately in varied contexts, explaining it, etc.). However, it does indicate the network has a *representation* of the concept in a way that is isomorphic (structurally similar) to how one might represent it symbolically. Searle’s Chinese Room argument (1980) posits that a system could manipulate symbols without understanding them. Here, the network did not even have explicit symbols, yet we as observers can *attribute* symbolic meaning to certain internal vectors. Whether the network “knows” the concept is a matter of definition, but it at least has a handle to turn that correlates with the concept in the world. This touches on the *symbol grounding problem* (Harnad, 1990): how do internal symbols get their meaning? In neural nets, the “meaning” of a hidden vector is grounded in how it affects outputs in relation to inputs. If moving along $v_C$ changes the output in a way humans interpret as “more C”, that hidden vector’s meaning is grounded by that causal role. Our work thus contributes to an operational solution to symbol grounding in AI systems: a concept is grounded by the set of inputs and outputs it governs when that internal representation is activated or modulated (76).
In the context of Kantian philosophy, one could muse that perhaps these networks, through training, develop a posteriori analogues of Kant’s a priori categories. They are not innate, but learned through exposure to data, yet once learned they function as lens through which the network “perceives” inputs. A network with a concept vector for “edible” vs “inedible” might, after training on a survival task, literally see the world of inputs divided along that categorical line in its latent space. Philosophy aside, this could be tested by checking if such a vector exists and influences behavior.
Lastly, our interdisciplinary narrative underscores a convergence: ideas from 18th-century philosophy, 20th-century cognitive science, and 21st-century deep learning are aligning around the existence of *structured, manipulable representations* as the cornerstone of intelligence. Plato’s Forms might have been metaphysical, but in a neural network, one can argue there is a “form” of a cat – not a physical cat, but an abstract cat-essence vector that the network uses. The fact that independent networks trained on different data sometimes find remarkably similar vectors (e.g., vision networks finding similar edge detectors (77), or language models converging on similar syntax neurons) gives a modern twist to the notion of universals.
\section{Conclusion}
In this work, we have expanded the "Ontologic Scalar Modulation Theorem" into a comprehensive framework linking the mathematics of neural network representations with the semantics of human-understandable concepts. By grounding our discussion in mechanistic interpretability and drawing on interdisciplinary insights from cognitive science and philosophy, we provided both a formal theorem and a rich contextual interpretation of its significance. The theorem itself formalizes how a neural network's internal *ontology* — the set of concepts it represents — can be probed and controlled via linear directions in latent space. Empirically, we illustrated this with examples from state-of-the-art models, showing that even complex concepts often correspond to understandable transformations in activation space.
Our treatment also highlighted the historical continuity of these ideas: we saw echoes of Kant's categories and Peirce's semiotics in the way networks structure information, and we related the learned latent ontologies in AI to longstanding philosophical debates about the nature of concepts and understanding. These connections are more than mere analogies; they suggest that as AI systems grow more sophisticated, the tools to interpret them may increasingly draw from, and even contribute to, the philosophy of mind and knowledge.
There are several promising directions for future work. On the theoretical side, relaxing the assumptions of linearity and extending the theorem to more complex (nonlinear or multi-dimensional) concept representations would broaden its applicability. We also aim to investigate automated ways of extracting a network’s full ontology, essentially building a taxonomy of all significant $v_C$ concept vectors a model uses, and verifying their interactions. On the applied side, integrating concept modulation techniques into model training could lead to networks that are inherently more interpretable, by design (for instance, encouraging disentangled, modulatable representations as part of the loss function). There is also a tantalizing possibility of using these methods to facilitate human-AI communication: if a robot can internally represent “hunger” or “goal X” along a vector, a human operator might directly manipulate that representation to communicate instructions or feedback.
In conclusion, the Ontologic Scalar Modulation Theorem serves as a bridge between the low-level world of neurons and weights and the high-level world of ideas and meanings. By traversing this bridge, we take a step towards AI systems whose workings we can comprehend in the same way we reason about programs or symbolic knowledge, a step towards AI that is not just intelligent, but also *intelligible*. We believe this line of research will not only improve our ability to debug and align AI systems, but also enrich our scientific understanding of representation and abstraction, concepts that lie at the heart of both artificial and natural intelligence.
\begin{thebibliography}{99}
\bibitem{Bereska2024} Bereska, L., \& Gavves, E. (2024). \textit{Mechanistic Interpretability for AI Safety: A Review}. arXiv:2404.14082. :contentReference[oaicite:78]{index=78}:contentReference[oaicite:79]{index=79}
\bibitem{Elhage2021} Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., et al. (2021). \textit{A Mathematical Framework for Transformer Circuits}. Distill (Transformer Circuits Thread). :contentReference[oaicite:80]{index=80}
\bibitem{Bau2017} Bau, D., Zhou, B., Khosla, A., Oliva, A., \& Torralba, A. (2017). \textit{Network Dissection: Quantifying Interpretability of Deep Visual Representations}. In \textit{Proc. CVPR}. :contentReference[oaicite:81]{index=81}
\bibitem{Bau2019} Bau, D., Zhu, J.-Y., Strobelt, H., Tenenbaum, J., Freeman, W., \& Torralba, A. (2019). \textit{GAN Dissection: Visualizing and Understanding Generative Adversarial Networks}. In \textit{Proc. ICLR}. :contentReference[oaicite:82]{index=82}
\bibitem{Kim2018} Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., \& Sayres, R. (2018). \textit{Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)}. In \textit{Proc. ICML}. :contentReference[oaicite:83]{index=83}:contentReference[oaicite:84]{index=84}
\bibitem{Goh2021} Goh, G., Sajjad, A., \& others. (2021). \textit{Multimodal Neurons in Artificial Neural Networks}. Distill. :contentReference[oaicite:85]{index=85}
\bibitem{Locatello2019} Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., \& Bachem, O. (2019). \textit{Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations}. In \textit{Proc. ICML}. :contentReference[oaicite:86]{index=86}
\bibitem{Newell1976} Newell, A., \& Simon, H. (1976). \textit{Computer Science as Empirical Inquiry: Symbols and Search}. \textit{Communications of the ACM, 19}(3), 113–126.
\bibitem{Fodor1988} Fodor, J. A., \& Pylyshyn, Z. W. (1988). \textit{Connectionism and Cognitive Architecture: A Critical Analysis}. \textit{Cognition, 28}(1-2), 3–71. :contentReference[oaicite:87]{index=87}
\bibitem{Harnad1990} Harnad, S. (1990). \textit{The Symbol Grounding Problem}. \textit{Physica D, 42}(1-3), 335–346.
\bibitem{Kant1781} Kant, I. (1781). \textit{Critique of Pure Reason}. (Various translations).
\bibitem{PlatoRepublic} Plato. (c. 380 BC). \textit{The Republic}. (Trans. Allan Bloom, 1968, Basic Books).
\bibitem{Peirce1867} Peirce, C. S. (1867). \textit{On a New List of Categories}. \textit{Proceedings of the American Academy of Arts and Sciences, 7}, 287–298.
\bibitem{Marr1982} Marr, D. (1982). \textit{Vision: A Computational Investigation into the Human Representation and Processing of Visual Information}. W. H. Freeman and Company.
\bibitem{Quiroga2012} Quian Quiroga, R. (2012). \textit{Concept cells: the building blocks of declarative memory functions}. \textit{Nature Reviews Neuroscience, 13}(8), 587–597. :contentReference[oaicite:88]{index=88}
\bibitem{Miller1995} Miller, G. A. (1995). \textit{WordNet: A Lexical Database for English}. \textit{Communications of the ACM, 38}(11), 39–41.
\bibitem{Lenat1995} Lenat, D. B. (1995). \textit{CYC: A Large-Scale Investment in Knowledge Infrastructure}. \textit{Communications of the ACM, 38}(11), 33–38.
\end{thebibliography}
\end{document}