An Operational Framework for Recursive Information Accumulation Under Observation Protocols (LaTex)
\documentclass[11pt]{article}
% --- Packages ---
\usepackage[a4paper,margin=1in]{geometry}
\usepackage{amsmath,amssymb,amsfonts,bm,amsthm,mathtools}
\usepackage{microtype}
\usepackage{hyperref}
\usepackage{enumitem}
\usepackage{tcolorbox}
\usepackage{tikz}
\usetikzlibrary{arrows.meta,positioning}
\usepackage{subcaption}
\usepackage{cleveref}
\hypersetup{
colorlinks=true,
linkcolor=blue!50!black,
urlcolor=blue!50!black,
citecolor=blue!50!black
}
% --- Operators & Macros ---
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\KL}{KL}
\newcommand{\E}{\mathbb{E}}
\newcommand{\given}{\,\middle|\,}
\newcommand{\1}[1]{\mathbf{1}\{#1\}}
\newcommand{\Hb}{H_b} % binary entropy shorthand
% --- Theorem environments ---
\newtheorem{proposition}{Proposition}[section]
\newtheorem{definition}{Definition}[section]
% --- Layperson box ---
\newtcolorbox{layperson}{
colback=blue!5!white,
colframe=blue!75!black,
title=\textbf{In Plain Language},
fonttitle=\bfseries
}
% --- Title ---
\title{An Operational Framework for Recursive Information Accumulation Under Observation Protocols}
\author{C.~L.~Vaillant}
\date{October 1, 2025}
\begin{document}
\maketitle
\begin{abstract}
We present an \emph{operational framework} that interprets the chain rule of mutual information as a principle of sequential learning under observation protocols. Mathematically, the core identity is standard; the novelty is the \emph{interpretation and unification}: protocols induce expanding filtrations of distinctions while posteriors contract their support, making stepwise information gain additive and monotonic. We show how classical results in Bayesian inference, experimental design, rate--distortion, functional information, and variational free energy can be viewed through this lens. Worked examples (Gaussian estimation with diminishing returns, binary search with constant returns, adaptive protocol choice, and soft constraints) illustrate the framework in action. We clarify scope (static vs.\ dynamic states, costs, measurability) and where connections are formal versus interpretive. The result is a pedagogical and conceptual synthesis aimed at researchers across machine learning, information theory, neuroscience, and biology.
\end{abstract}
%---------------- FIGURE (subfigures) ----------------%
\begin{figure}[t]
\centering
\begin{subfigure}[t]{0.45\textwidth}
\centering
\begin{tikzpicture}[scale=0.9, every node/.style={font=\small}]
\draw[thick] (0,0) circle [x radius=2.7, y radius=1.6];
\draw[thick] (0,0) circle [x radius=2.1, y radius=1.25];
\draw[thick] (0,0) circle [x radius=1.5, y radius=0.9];
\node at (0,1.85) {$\mathcal{F}_{C_2}$};
\node at (0,1.45) {$\mathcal{F}_{C_1}$};
\node at (0,1.05) {$\mathcal{F}_{C_0}$};
\draw[->] (-2.9,-1.9) -- (3.2,-1.9) node[below] {$t$};
\draw[fill] (-2.0,-1.9) circle(1.2pt) node[below] {$0$};
\draw[fill] (0.0,-1.9) circle(1.2pt) node[below] {$1$};
\draw[fill] (2.0,-1.9) circle(1.2pt) node[below] {$2$};
\end{tikzpicture}
\caption{Filtration growth: protocols expand admissible distinctions ($\mathcal{F}_{C_0}\subset \mathcal{F}_{C_1}\subset \mathcal{F}_{C_2}$).}
\end{subfigure}\hfill
%
\begin{subfigure}[t]{0.45\textwidth}
\centering
\begin{tikzpicture}[scale=0.9, every node/.style={font=\small}]
\draw[->] (0,-2.0) -- (5.0,-2.0) node[below] {$x$};
\draw[->] (0,-2.0) -- (0,2.0) node[left] {$p_t(x)$};
% Three Gaussians narrowing
\draw[thick] plot[smooth,domain=0.2:4.6] ({\x},{1.2*exp(-((\x-2.5)^2)/1.6)-2.0});
\draw[thick] plot[smooth,domain=0.4:4.4] ({\x},{1.7*exp(-((\x-2.5)^2)/0.9)-2.0});
\draw[thick] plot[smooth,domain=0.6:4.2] ({\x},{2.2*exp(-((\x-2.5)^2)/0.5)-2.0});
\node at (2.5,-1.2) {$t=0,1,2$};
\end{tikzpicture}
\caption{Posterior contraction under informative protocols.}
\end{subfigure}
\vspace{1em}
\begin{subfigure}[t]{0.9\textwidth}
\centering
\begin{tikzpicture}[scale=0.9, every node/.style={font=\small}]
\draw[->] (0,0) -- (6.6,0) node[below] {$T$};
\draw[->] (0,0) -- (0,3.0) node[left] {$I_T$};
% Linear (binary search)
\draw[thick] (0,0) -- (6.2,2.5) node[pos=0.9,above] {binary search ($\propto T$)};
% Log-like (Gaussian)
\draw[thick,domain=0.1:6.2,smooth] plot (\x,{1.1*ln(1+1.2*\x)/ln(2)});
\node at (3.5,1.2) {Gaussian ($\propto \log T$)};
\end{tikzpicture}
\caption{Information accumulation: linear (binary search) vs.\ logarithmic (Gaussian estimation).}
\end{subfigure}
\caption{Three complementary views of recursive information gain.}
\label{fig:filtration_contraction}
\end{figure}
%-----------------------------------------------------%
\section{Introduction}
Learning can be viewed as the recursive narrowing of a possibility space under structured observations. Each observation protocol carves distinctions, eliminating incompatible states while refining probability mass over those that remain. This work introduces a unifying framework governing such recursive information processes, situating classical results across disciplines within a single lens.
\begin{layperson}
Imagine ``20 Questions'': each question carves away many options. Our claim is not a new theorem; it's a clear way to see \emph{existing} theorems as one story about how observations add up and shrink uncertainty.
\end{layperson}
\section{Preliminaries and Notation}
\subsection{Protocols and Filtrations}
\begin{definition}[Protocol and Filtration]
A \emph{protocol} at step $t$, denoted $C_t$, specifies a channel $p(y_t\mid x_t,C_t)$ mapping latent $X_t$ to an observation $Y_t$. Each $C_t$ induces a $\sigma$-algebra $\mathcal{F}_{C_t}$ of admissible distinctions over the latent space. A sequence $(C_t)_{t=0}^T$ generates a filtration $\mathcal{F}_{C_0}\subseteq \cdots \subseteq \mathcal{F}_{C_T}$.
\end{definition}
Let $(X_t)_{t=0}^T$ be latent states and $(Y_t)_{t=1}^T$ observations. Distinct protocols may induce distinct observation spaces; conditioning is always on realized observations $Y_{1:t-1}$.
\subsection{Chain Rule of Mutual Information}
\begin{proposition}[Chain Rule (static latent)]\label{prop:chain}
For a static latent $X_0$ and observations $Y_{1:T}$ under any protocol sequence $(C_t)$,
\begin{equation}\label{eq:chain}
I(X_0; Y_{1:T}) = \sum_{t=1}^T I(X_0; Y_t \mid Y_{1:t-1}).
\end{equation}
\end{proposition}
\section{Framework and Consequences}
\begin{proposition}[Operational Framework (static)]\label{prop:framework-static}
For any sequence of observation protocols $(C_t)_{t=1}^T$,
\begin{equation}\label{eq:framework}
I_T \coloneqq H(X_0) - H(X_0 \mid Y_{1:T})
= \sum_{t=1}^T I(X_0; Y_t \mid Y_{1:t-1}) \;\geq 0.
\end{equation}
Moreover,
\[
\Delta I_t = I(X_0;Y_t\mid Y_{1:t-1})
= \E_{y_{1:t}}\!\left[\KL\!\big(p(x_0\mid y_{1:t}) \,\|\, p(x_0\mid y_{1:t-1})\big)\right]\!\ge 0.
\]
\end{proposition}
\begin{proposition}[Operational Framework (dynamic trajectory)]\label{prop:framework-dyn}
If $(X_t)$ evolves via $p(x_t\mid x_{t-1})$, then
\begin{equation}\label{eq:framework-dyn}
I(X_{0:T};Y_{1:T})
= \sum_{t=1}^T I(X_{0:t}; Y_t \mid Y_{1:t-1})
\;\ge 0,
\end{equation}
with increments
\[
\Delta I_t = I(X_{0:t};Y_t\mid Y_{1:t-1})
= \E_{y_{1:t}}\!\left[\KL\!\big(p(x_{0:t}\mid y_{1:t}) \,\|\, p(x_{0:t}\mid y_{1:t-1})\big)\right]\!\ge 0.
\]
\end{proposition}
\noindent The filtration expands by construction ($\mathcal{F}_{C_t}\supseteq \mathcal{F}_{C_{t-1}}$), while the posterior contracts under \emph{informative} protocols, i.e., whenever $\Delta I_t>0$.
\section{Scope and Assumptions}\label{sec:scope}
\paragraph{State dynamics.} We accommodate both static latents ($X_t\equiv X_0$) and dynamic latents with transition kernel $p(x_t\mid x_{t-1})$; in the latter case, work with the trajectory $X_{0:T}$.
\paragraph{Measurability.} When distinct protocols induce distinct observation spaces, we work on the product $\sigma$-algebra; conditioning is always on realized observations $Y_{1:t-1}$.
\paragraph{Costs.} Protocol selection (\cref{sec:selection}) incorporates resource costs via $\mathrm{Cost}(C)$; the chain decomposition \eqref{eq:framework} itself is unaffected.
\paragraph{Stationarity.} Asymptotic results (\cref{sec:asymptotics}) assume stationarity and/or ergodicity; the core framework does not.
\section{Unified Update Rule}\label{sec:update}
The posterior evolves via Bayesian reweighting:
\begin{equation}\label{eq:update}
p_{t+1}(x) = \frac{p_t(x)\,w_{C_t}(x)}{Z_t}, \qquad
Z_t = \E_{p_t}[w_{C_t}(X)],
\end{equation}
where $w_{C_t}(x)$ is the likelihood or weight assigned by protocol $C_t$ to state $x$.
\paragraph{Hard constraints:} $w_{C_t}(x)\in\{0,1\}$ (states are either ruled out or remain possible).
\paragraph{Soft constraints:} $w_{C_t}(x)\in [0,\infty)$ (states are continuously reweighted).
\begin{layperson}
If you observe something very likely under hypothesis $x$, you increase its weight. Hard constraints are like perfect clues (``the suspect was in Paris''); soft constraints are like probabilistic evidence (``70\% chance of rain'').
\end{layperson}
\section{Active Protocol Selection}\label{sec:selection}
When agents can \emph{choose} protocols, the optimal choice balances information gain against cost:
\begin{equation}\label{eq:selection}
C_t^\star = \argmax_{C \in \mathcal{U}} \Big\{ I(X_t; Y_t \mid Y_{1:t-1},C) - \lambda \,\mathrm{Cost}(C) \Big\}.
\end{equation}
This principle underlies active learning~\cite{mackay1992}, optimal experimental design~\cite{lindley1956}, and efficient foraging in biology.
\section{Interpretive Mappings: Scope and Limits}\label{sec:embodiments}
\subsection{Functional Information under a Perfect Test Protocol}
Let $\mathcal{X}$ be finite with uniform prior $p_0$. Define the feasible set $A_E=\{x: f(x)\ge E\}$. If we model the protocol as a \emph{noiseless threshold test} that perfectly distinguishes $A_E$ from its complement (i.e., $Y_1=\1{x\in A_E}$ with no errors), then the one-step update yields $p_1(x)\propto \1{x\in A_E}$ (uniform on $A_E$), and
\begin{equation}\label{eq:funcinfo}
\Delta I_1 = H(X)-H(X\mid Y_1)
= \log_2 |\mathcal{X}| - \log_2 |A_E|
= -\log_2 \frac{|A_E|}{|\mathcal{X}|}
= I_f(E).
\end{equation}
Thus, under a perfect test protocol, functional information equals the information gain from a single hard constraint, consistent with \cite{hazen2007,walker2016}. With noise or a non-uniform prior, the analogous $\Delta I_1$ equals the KL divergence between prior and renormalized feasible posterior (not exactly $I_f$).
\subsection{Rate--Distortion as Protocol Design}
Model a protocol as $p(z\mid y,C)$ mapping detailed reality $Y$ to representation $Z$. Shannon's rate--distortion problem~\cite{shannon1959}
\begin{equation}\label{eq:ratedist}
R(D) = \min_{p(z|y)\,:\,\mathbb{E}[d(Y,Z)]\le D} I(Y;Z)
\end{equation}
\emph{minimizes} information subject to a fidelity constraint. Our framework, by contrast, \emph{takes} a protocol and \emph{computes} its information yield (or selects among protocols via \eqref{eq:selection}). These are dual perspectives: rate--distortion is constrained \emph{protocol design}, while \eqref{eq:framework} is \emph{accounting} given a protocol.
\subsection{Variational Free Energy (Related, Not a Special Case)}
The variational free energy~\cite{friston2010}
\begin{equation}\label{eq:freeenergy}
F = \mathbb{E}_q[\log q(x) - \log p(x,Y)] = D_{\mathrm{KL}}(q \| p(\cdot \mid Y)) - \log p(Y)
\end{equation}
minimizes an upper bound on $-\log p(Y)$ by choosing approximate posterior $q$ (and often model parameters). While conceptually aligned---maximize fit to data, minimize residual uncertainty---free energy optimizes over \emph{inference procedures}, not fixed protocols $C_t$. We treat it as a closely related objective rather than a strict specialization of \eqref{eq:framework}.
\subsection{Relation to Lindley Information}
Lindley (1956) introduced the expected information gain from an experiment~\cite{lindley1956}:
\begin{equation}\label{eq:lindley}
\mathbb{E}_Y[I(X;Y)] = H(X) - \mathbb{E}_Y[H(X\mid Y)].
\end{equation}
Our contribution extends this to \emph{sequential} protocols with explicit filtration growth versus posterior contraction, and shows how the chain rule decomposes cumulative learning.
\section{Worked Examples}\label{sec:examples}
\subsection{Gaussian Mean Estimation (Diminishing Returns)}
\textbf{Setup:} True mean $\mu\sim\mathcal{N}(0,v_0)$, observations $y_t=\mu+\epsilon_t$ with $\epsilon_t\sim\mathcal{N}(0,\sigma^2)$.
\textbf{Posterior evolution:}
\begin{align}
p_t(\mu) &= \mathcal{N}(\mu \mid m_t, v_t), \\
v_t^{-1} &= v_0^{-1} + t\sigma^{-2}, \\
m_t &= v_t \sigma^{-2} \sum_{i=1}^t y_i.
\end{align}
\textbf{Information gain per step:}
\[
\Delta I_t = I(\mu; Y_t \mid Y_{1:t-1}) = \tfrac{1}{2}\log_2\!\left(1+ \frac{v_{t-1}}{\sigma^2}\right).
\]
Since $v_t \to 0$ as $t\to\infty$, we have $\Delta I_t\to 0$: \emph{diminishing returns}.
\textbf{Cumulative information (exact for this model):}
\[
I_T = \tfrac{1}{2}\log_2\!\left(1+ \frac{v_0 T}{\sigma^2}\right) \approx \tfrac{1}{2}\log_2 T \quad \text{for large } T.
\]
\begin{layperson}
The first few measurements are highly informative (``Is it 10m or 100m tall?''), but after 100 measurements, one more barely changes your estimate. Information grows like $\log T$—doubling measurements adds a constant amount, not double.
\end{layperson}
\subsection{Binary Search (Constant Returns)}
\textbf{Setup:} Hidden number $x\in\{1,\ldots,N\}$ with uniform prior. Protocol $C_t$: ask ``Is $x\le m_t$?'' where $m_t$ is the median of remaining possibilities.
\textbf{Observation:} $Y_t\in\{\text{yes},\text{no}\}$ with $p(\text{yes}\mid x)=\1{x\le m_t}$.
\textbf{Information gain per step:}
Because the response is deterministic given $X$, $H(Y_t\mid X)=0$ and
\[
\Delta I_t = I(X;Y_t\mid Y_{1:t-1}) = H(Y_t\mid Y_{1:t-1})
= \Hb\!\big(\tfrac{1}{2}\big) = 1~\text{bit}
\]
\emph{when} the query splits the remaining posterior mass exactly in half (e.g., powers of two under a uniform prior). In general, $\Delta I_t = \Hb(\pi_t)$ with $\pi_t$ the posterior mass on the ``yes'' branch, so $\Delta I_t \le 1$ with equality iff $\pi_t=\tfrac{1}{2}$.
\textbf{Cumulative information:}
\[
I_T \le T \text{ bits}, \qquad
I_T = T \text{ bits if all splits are exact, and } N_{\text{remaining}} = N/2^T.
\]
\subsection{Protocol Choice with Uneven Prior (Active Binary Query)}
Let $X$ be an ordered random variable with posterior $p_{t-1}$ at step $t$. Consider protocol $C_t(c)$ that asks ``Is $X\le c$?''. Define the cumulative mass:
\[
\pi(c) \equiv \mathbb{P}(X\le c\mid Y_{1:t-1}) = \sum_{x\le c} p_{t-1}(x).
\]
Because the answer is deterministic given $X$, we have $H(Y_t\mid X,Y_{1:t-1})=0$, so
\begin{equation}\label{eq:binaryentropy}
\Delta I_t(c) = I(X;Y_t\mid Y_{1:t-1}) = H(Y_t\mid Y_{1:t-1}) = \Hb\!\big(\pi(c)\big),
\end{equation}
where $\Hb(p)=-p\log_2 p-(1-p)\log_2(1-p)$ is binary entropy. This is maximized at $\pi(c)=\tfrac{1}{2}$, i.e., \emph{choose $c$ as the posterior median}. Thus the most informative yes/no question ``splits current belief mass in half,'' generalizing bisection to non-uniform priors.
\subsection{Soft Constraints via Bernoulli Observations (Beta--Bernoulli)}
Let $\theta\in(0,1)$ be an unknown bias with prior $\mathrm{Beta}(\alpha_{t-1},\beta_{t-1})$ at step $t{-}1$. Protocol $C_t$ observes $Y_t\sim\mathrm{Bernoulli}(\theta)$, corresponding to soft weights
\[
w_{C_t}(\theta) = \theta^{y_t}(1-\theta)^{1-y_t}.
\]
Posteriors remain Beta by conjugacy:
\begin{align}
\theta\mid Y_{1:t-1} &\sim \mathrm{Beta}(\alpha_{t-1},\beta_{t-1}), \\
\theta\mid Y_{1:t-1},Y_t{=}1 &\sim \mathrm{Beta}(\alpha_{t-1}{+}1,\beta_{t-1}), \\
\theta\mid Y_{1:t-1},Y_t{=}0 &\sim \mathrm{Beta}(\alpha_{t-1},\beta_{t-1}{+}1).
\end{align}
The information increment is
\begin{equation}\label{eq:betainfo}
\Delta I_t = h_2(\theta\mid Y_{1:t-1}) - \mathbb{E}_{Y_t\mid Y_{1:t-1}}[h_2(\theta\mid Y_{1:t})],
\end{equation}
where $h_2(\cdot)$ denotes differential entropy in \emph{bits}. For $\mathrm{Beta}(\alpha,\beta)$, define the base-2 digamma $\psi_2(x)\equiv \psi(x)/\ln 2$ and write
\[
h_2(\alpha,\beta) = \log_2 B(\alpha,\beta) - (\alpha-1)\psi_2(\alpha) - (\beta-1)\psi_2(\beta) + (\alpha+\beta-2)\psi_2(\alpha+\beta),
\]
with $B(\alpha,\beta)=\Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$. Thus the increment in \emph{bits} is
\begin{multline}\label{eq:betabits}
\Delta I_t =
h_2(\alpha_{t-1},\beta_{t-1})
- \frac{\alpha_{t-1}}{\alpha_{t-1}+\beta_{t-1}}\,h_2(\alpha_{t-1}{+}1,\beta_{t-1}) \\
- \frac{\beta_{t-1}}{\alpha_{t-1}+\beta_{t-1}}\,h_2(\alpha_{t-1},\beta_{t-1}{+}1).
\end{multline}
As $\alpha_{t-1}+\beta_{t-1}$ grows, posterior variance shrinks and $\Delta I_t\to 0$ (diminishing returns), exhibiting the soft-constraint analog of Gaussian estimation.
\section{Asymptotics and Predictive Information}\label{sec:asymptotics}
For stationary processes under fixed protocols, define the \textbf{information rate}:
\begin{equation}\label{eq:inforate}
r = \lim_{T\to\infty}\frac{1}{T}I(X_{1:T};Y_{1:T}).
\end{equation}
The \textbf{predictive information}~\cite{bialek2001} quantifies how much the infinite past reveals about the infinite future:
\begin{equation}\label{eq:predinfo}
I_{\mathrm{pred}} = I(Y_{-\infty:0}; Y_{1:\infty}).
\end{equation}
Both are limiting cases of the recursive framework \eqref{eq:framework}.
\section{Summary and Open Questions}\label{sec:summary}
We have presented an operational framework interpreting the chain rule of mutual information as sequential learning under observation protocols. The core insights are:
\begin{enumerate}[nosep]
\item \textbf{Additivity and monotonicity}: Information decomposes stepwise and never decreases.
\item \textbf{Filtration growth vs.\ posterior contraction}: Distinctions expand while plausible states shrink.
\item \textbf{Unification (formal and interpretive)}: Bayesian updating, active learning, rate--distortion, functional information, and variational free energy can be framed under this lens; some connections are formal, while others (e.g., free energy) are closely related but not strict special cases.
\end{enumerate}
\subsection{Sharpened Open Questions}
\begin{itemize}[nosep]
\item \textbf{Information--cost frontiers under physical budgets.} Suppose each protocol has energy/time cost $c(C)$ and a budget $\sum_t c(C_t)\le B$. Characterize upper bounds on $\sum_t \Delta I_t$ as a function of $B$ and channel families, extending Landauer's single-bit bound~\cite{landauer1961}. Connect to information thermodynamics~\cite{sagawa2012} and identify conditions for tightness.
\item \textbf{Adversarial protocol games.} Define a minimax value
\begin{equation}\label{eq:minimax}
V_T = \sup_{\pi(C_t\mid Y_{1:t-1})} \inf_{p(y\mid x,C)\in\mathcal{E}} \sum_{t=1}^T \Delta I_t,
\end{equation}
where $\pi$ is the learner's protocol policy and $\mathcal{E}$ is an uncertainty set over channels. Relate $V_T$ to regret bounds in adversarial bandits~\cite{auer2002} and information-directed sampling trade-offs~\cite{russo2018}.
\item \textbf{Non-stationary latent processes.} Define \emph{bit-regret} $R_T=\sum_t (I_t^\star-\Delta I_t)$ relative to an oracle protocol sequence $\{C_t^\star\}$. Give conditions under which $R_T=O(\log T)$ or $O(\sqrt{T})$, and connect to online learning lower bounds.
\end{itemize}
\section*{Acknowledgments}
The author thanks [names] for helpful discussions.
\section*{References}
\begin{thebibliography}{99}
\bibitem{cover}
T.~M. Cover and J.~A. Thomas,
\emph{Elements of Information Theory}, 2nd ed.
Wiley-Interscience, 2006.
\bibitem{lindley1956}
D.~V. Lindley,
``On a measure of the information provided by an experiment,''
\emph{Annals of Mathematical Statistics}, vol.~27, no.~4, pp.~986--1005, 1956.
\bibitem{shannon1959}
C.~E. Shannon,
``Coding theorems for a discrete source with a fidelity criterion,''
\emph{IRE National Convention Record}, Part 4, pp.~142--163, 1959.
\bibitem{mackay1992}
D.~J.~C. MacKay,
``Information-based objective functions for active data selection,''
\emph{Neural Computation}, vol.~4, no.~4, pp.~590--604, 1992.
\bibitem{friston2010}
K.~Friston,
``The free-energy principle: a unified brain theory?,''
\emph{Nature Reviews Neuroscience}, vol.~11, no.~2, pp.~127--138, 2010.
\bibitem{bialek2001}
W.~Bialek, I.~Nemenman, and N.~Tishby,
``Predictability, complexity, and learning,''
\emph{Neural Computation}, vol.~13, no.~11, pp.~2409--2463, 2001.
\bibitem{walker2016}
S.~I. Walker, H.~Kim, and P.~C.~W. Davies,
``The informational architecture of the cell,''
\emph{Philosophical Transactions of the Royal Society A}, vol.~374, no.~2063, 2016.
\bibitem{hazen2007}
R.~M. Hazen, P.~L. Griffin, J.~M. Carothers, and J.~W. Szostak,
``Functional information and the emergence of biocomplexity,''
\emph{Proceedings of the National Academy of Sciences}, vol.~104, no.~23, pp.~8574--8581, 2007.
\bibitem{landauer1961}
R.~Landauer,
``Irreversibility and heat generation in the computing process,''
\emph{IBM Journal of Research and Development}, vol.~5, no.~3, pp.~183--191, 1961.
\bibitem{sagawa2012}
T.~Sagawa and M.~Ueda,
``Nonequilibrium thermodynamics of feedback control,''
\emph{Physical Review E}, vol.~85, 021104, 2012.
\bibitem{auer2002}
P.~Auer, N.~Cesa-Bianchi, Y.~Freund, and R.~E. Schapire,
``The nonstochastic multiarmed bandit problem,''
\emph{SIAM Journal on Computing}, vol.~32, no.~1, pp.~48--77, 2002.
\bibitem{russo2018}
D.~Russo and B.~Van Roy,
``Learning to optimize via information-directed sampling,''
\emph{Operations Research}, vol.~66, no.~1, pp.~230--252, 2018.
\end{thebibliography}
\end{document}