PIAMI — Physics-Informed Ambitious Mechanistic Interpretability

A Comparative History of Mech Interp

Mechanistic Interpretability Is Ready for Its
Thomson Moment

Instead of characterizing mechanistic interpretability in terms of its paradigmatic status, we argue that it is in the middle of a normal and healthy cycle of epistemic iteration. A stark parallel exists with thermometry in the 1840s, just before Thomson grounded temperature in a theory that could make falsifiable predictions and guide measurement. Today's mech interp has instruments — SAEs, probes, activation patching — that work well enough for engineering purposes, and anomalies the current theoretical framework cannot explain. Viewing these as targets for a principled theoretical foundation grounded in statistical physics, we claim that mechanistic interpretability is ready for its Thomson moment. The two histories below make this parallel precise.

Part I

A History of Mechanistic Interpretability

Part II

A History of Thermometry

First Wave · 2012–2021

Neurons as Variables

Mechanistic interpretability was born as a project of reverse engineering an algorithm from activations. Against the view that networks were just bags of heuristics, early researchers optimized inputs to maximally activate individual neurons and found genuine interpretable structure. Early-layer neurons responded to curves and edges that composed across depth into progressively more complex descriptive features.

Anomaly: Some neurons fired for multiple distinct features (e.g., cats and cars) at once. This implied that networks represented more variables than they had neurons, upending the idea that neurons corresponded to monosemantic atomic units.

First Instruments · c.1600–1780s

Building Before Understanding

For two centuries thermometry was an empirical craft. Instruments like Galileo's thermoscope or Fahrenheit's mercury thermometer were engineered based on intuition and pragmatic convention. Measurement scales (Celsius, Fahrenheit, Kelvin) were calibrated to reproducible but arbitrary reference points like the freezing and boiling of water. Without a theory of temperature to guide how they worked, there was no way to determine which scale should be preferred or what it meant, physically, to say that something was "hotter".

Parallel: Both fields built functional probes that preceded theoretical understanding.

Second Wave · 2022–2025

Superposition & the Engineering Turn

The anomaly of polysemanticity gave rise to the Superposition Hypothesis: a dense network is a noisy simulation of a much larger sparse model. This inspired the use of Sparse Autoencoders (SAEs), a tool from the compressed sensing community in theoretical computer science, to recover largely monosemantic features. This move cast interpretability as an engineering problem; the field became an engineering culture that hill-climbed on benchmarks through slight architectural adjustments to SAEs.

Anomaly: Feature splitting and absorption, plus non-linear, geometric, and distributed representations, all fell outside the strong superposition hypothesis (for example, see these posts ).

The Impasse · 1840s

Regnault's Problem

Regnault showed that different gas thermometers gave small, systematic, and reproducible disagreements on the same physical phenomenon. Regnault's problem showed that the empirical path had hit a wall: to determine which thermometer gave the correct reading, you needed a principled temperature standard.

Parallel: How do you theoretically define an observable (like 'temperature' or 'feature') without measuring it, or measure it without a definition?

Third Wave · 2025–Now

Pessimism, Pragmatism, and the Fork

A growing faction has updated against the ambitious project of complete decompilation, pivoting to methods with immediate empirical payoffs.

The pragmatist position is not without merit: focusing on production-ready tools serves short AI timelines, and it is useful to evaluate the extent to which the ambitious target is more of a time-consuming distraction than a potential impact amplifier. However, PIAMI holds that the pragmatists have overcorrected. Rather than dooming the project, the insufficiency of current tools reflects the lack of theoretical understanding that has always narrowed this kind of theory-practice gap. The problems the field faces are not engineering problems but scientific ones.

Anomaly: These are numerous, including how to define and operationalize a feature, how to quantify the uncertainty of an interpretation, how to design tools based on non-sparse structures like compositionality and hierarchy.

The Thomson Moment · 1848

Grounding Temperature in Theory

Thomson resolved the impasse by reconceptualizing the problem rather than collecting more data, and defined temperature via Carnot efficiency. This replaced the immeasurable question "which gas is correct?" into a tractable one ("which gas is most efficient"), anchoring the measurement of temperature to fundamental energy limits rather than the behaviors of arbitrary materials. This close interplay between theoretical development and instrument design not only led to more reliable thermometers, but to the development of an absolute temperature scale by Kelvin and the formal science of thermodynamics as we know it today.

Parallel: For AI interpretability to succeed, theory and practice need to take each other seriously. Following the logic of temperature's history, we advocate for theoretical foundations that are developed with measurement in mind.

See Open Problems

Roadmap

A Physics-Informed Science of AI Safety is Important and Possible

The feedback loop between theory and empirics has driven physics forward for centuries. The call for a similar approach to AI safety is getting stronger, including organizations like Timaeus (now Resolution), Simplex, XOR Labs, ARC, Principia, and Stormglass. Several of the research threads from these groups depend — implicitly or explicitly — on statistical physics' foundations.

The physics community also offers a deep well of knowledge, with a growing number of academic researchers working on understanding deep learning through the lens of effective field theory and renormalization. Some of these threads, pulled from various aspects of deep learning theory, have been woven into the learning mechanics agenda.

We are interested in developing theoretical frameworks that make increasingly accurate predictions about models trained on natural data, including their internal representations and inference algorithms. This requires understanding how the structure of data interplays with learning algorithms and learned representations, then translating that understanding into practical safety tools.

This roadmap is a living document. It will be revised as the field and our own thinking evolve.

See Who's Involved

Guiding Question

Can we construct idealized models rich enough to capture intrinsic properties of data structure (hierarchical, compositional, sparse, sequential) while remaining tractable enough to make quantitative predictions? What observables do these models share with natural data, and how do they constrain which theories of data structure are actually useful?

Scaling Laws

Can we derive analytic scaling exponents directly from the statistics of natural data, without synthetic data models?

▾

Scaling laws have been a key observable for physics-inspired approaches. Cagnetta 2026 manage to quantitatively predict exponents of data-limited scaling laws by looking at two quantities: how pairwise token correlations decay as they become further apart, and how the next-token conditional entropy decays with respect to the length of the conditioning context.

Coppola et al. 2025 take this physics analogy further, developing a renormalization group (RG) framework for learning curves of weakly non-linear networks trained on power-law distributed data. By treating training as an RG flow that progressively integrates high-frequency modes of the data, they analyze the self-similarity and universality of scaling laws. A notable finding is that features typically neglected in standard treatments, such as the discreteness of the data spectrum and lack of translation invariance, lead to both quantitative and qualitative departures from conventional perturbative RG predictions.

Hierarchical & Compositional Data

How does hierarchical, compositional structure in data govern token correlations, sample complexity, and the emergence of multi-scale representations?

▾

Brill 2024 introduces a percolation model of natural data to study scaling laws, showing that hierarchical, sparse structure gives rise to power-law scaling regimes. Brill 2025 extends this to representation learning, using the same random lattice model to categorize learned features by their compositional role of context, component, and surface.

In modelling language specifically, Cagnetta 2024a and 2024b introduce Probabilistic Context Free Grammars (PCFGs) and use them to study how latent hierarchical tree-like structure in grammar governs token-to-token correlations, and how the effective range of these correlations is governed by training set size. Wyart 2025 utilize diffusion models to probe latent hierarchical structure in natural data.

Sequential Data

What is the right formalism for studying minimal sufficient statistics in sequential prediction, and how does it extend to reinforcement learning agents?

▾

The modeling of sequential data can be found in computational mechanics (Shalizi and Crutchfield 1999), which in a hidden Markov model setting studies minimal sufficient statistics (MSS) for optimal generation and prediction. More recently, Rosas et al. 2025 extends this to track MSS for input-output processes, giving a natural formalism to study (PO)MDPs — the classical setting of reinforcement learning agents.

Universality of Representations

Under what conditions do different learning systems converge to the same latent representation of a phenomenon, and how robust is this convergence to approximation?

▾

Wentworth 2025 explores this in a Bayesian setting, showing that if two agents agree on predictions over a pair of observables but use different latent variables, any two such "natural latents" are isomorphic — with results robust to approximation. The conditions are that the latent must mediate between the observables and be recoverable from either one individually.

Eisenstat 2025 generalises this to the case where an agent's world-model consists of a structured family of latent variables. Under a condition called perfect condensation, any two such latent families over the same observables must correspond, with an approximate version of the correspondence theorem holding more generally via an information-theoretic inequality.

Guiding Question

How do different training phenomena (hierarchical learning, phase transitions, multi-scale dynamics) interact with data structure to shape internal representations? When can the noise generated by training be treated as a perturbative correction, and how does this interact with noise from initialization or finite sample size?

Statics · NNFT

What is the right parameterization of the tilt space for transformer-like architectures?

▾

NNFT uses mean-field theory, treating the GP component as a higher-order correction to an adaptive background. The mean-field background can encode arbitrary correlations between layers and neurons — getting around NNGP++ expressivity limits, and Vaintrob et al. 2026 show this mean-field description is complexity complete. This suggests viewing "circuits" as shifts in the Bayesian posterior's kernel structure rather than fixed computational structures.

The framework has genuine limitations. The space of possible distributions is a priori infinite-dimensional. Restricting to "tilts" (exponential families) improves tractability, reducing the saddle-point problem to finding a minimizer on a finite-dimensional space — but identifying the right parameterization of this tilt space for transformer-like architectures remains an open problem.

Statics · NNFT

How do we develop a unified theory that handles both the bulk (described by MFT) and instanton-like specialized minority neurons?

▾

The mean-field description assumes neurons are statistically exchangeable — each drawn i.i.d. from a shared prior. This breaks down when a small number of neurons are highly specialized, playing qualitatively distinct roles that cannot be captured by any single distribution over rows. In physics language, these are instantons: non-perturbative, localized configurations outside the saddle-point + Gaussian fluctuation picture entirely.

Developing a unified theory that handles both the bulk and instanton-like specialized minority is a key future direction for interpretability. Such cases may be better described by Saxe-inspired analyses of linear network dynamics, where specialization emerges through singular value structure.

Statics · NNFT

What self-consistent distribution over neurons has the network converged to, and what computations does that distribution implement?

▾

If NNFT is the right asymptotic description of the structures we care about, it gives us the right objects to reason about. Instead of asking why specific weights have specific values, a question with no clean answer, we ask what self-consistent distribution over neurons the network has converged to, and what computations that distribution implements.

This is a better-posed question, and one that makes contact with the circuit-level descriptions that interpretability already uses, but grounds them in a principled theory of why those circuits form, when they form, and how they scale.

Statics · SLT

Do MFT phase transitions and RLCT transitions track the same underlying structure? Can the RLCT be computed or bounded by MFT analysis?

▾

Singular Learning Theory (SLT) paints an idealized picture of model behavior controlled by local geometric structure. The Real Log Learning Coefficient (RLCT) governs the geometry of the posterior near a local minimum; MFT describes the structure of that posterior as a saddle-point approximation.

Whether these descriptions are compatible, and whether the RLCT can be computed or bounded by MFT analysis, is an important open direction. Pursuing this would lend increased theoretical support for why SLT observables track circuit structure.

Progress here depends on observables that can actually be measured at scale. Gordon et al. 2026 address how susceptibilities scale as diagnostics of internal structure, while Hennick et al. 2026 construct spectral early-warning signals from reduced density matrices, detecting critical slowing down before a transition completes.

Dynamics

Under what conditions does the stochastic (DMFT) or deterministic (saddle-to-saddle) picture of learning dynamics provide a better theoretical explanation?

▾

Theories of learning dynamics treat noise in two ways. The deterministic picture (Saxe et al. 2014, Abbe et al. 2023) studies discrete symmetry-breaking events — saddle-to-saddle transitions where different singular modes of the target function appear in sequence. The stochastic picture (Bordelon & Pehlevan 2022, 2026) uses DMFT to formulate the kernel as a dynamical object during training.

The circumstances under which one perspective provides a better theoretical explanation than the other remains open. Incorporating finite learning rate as a thermodynamic control parameter into DMFT frameworks, where it would enter alongside width, depth, and initialization scale as a regulator of the feature-learning regime, is also an important open direction.

Post-Training

Should model-free RL be conceptualized as eliciting latent capabilities or discovering novel ones? How does this differ across fine-tuning methods?

▾

On the elicitation end, Yue 2025 shows that base models achieve higher pass@k at large k on math, reasoning, and coding benchmarks than their RLVR fine-tuned counterparts, suggesting that original reasoning abilities originate from and are bounded by the base model. On the discovery end, Bush et al. 2025 provides mechanistic evidence that model-free RL algorithms can learn policies resembling system 2 reasoning.

Further investigation into the learnability of different policies and the characterization of their safety profiles is warranted. Hazan et al. 2025 propose a research program investigating learnability of stochastic processes by treating them as dynamical systems and studying their stability, mixing, observability, and spectral properties.

Guiding Question

How can we classify the different kinds of representations, circuits, and algorithms learned in neural networks? What does each framework treat as its fundamental object and implicit complexity measure? How do theories of learning and representations inform interpretability tools?

Geometry

Can the "data kernel matrix" be linked to atomic, interpretable features beyond PCA components?

▾

Early work in NTK and Gaussian process models created a paradigm of feature learning as a physically-aligned model of representations, classifying inputs by their PCA in activation space. While valuable for lazy learning and NNGP++ paradigms, this approach works poorly for interpreting semantic features or mechanisms in sophisticated models.

A richer picture treats activation data as a kernel matrix (the matrix of activation dot products in the data basis) rather than raw PCA features. In some sense this is a vacuously general object: any two neural nets with the same activation dot products are equivalent from the point of view of training and Bayesian learning. Linking the data kernel matrix to more atomic and interpretable features beyond PCA has not yet been done in a satisfying way. Vaintrob 2026 approaches the same gap from sparsity and frustration, asking what statistical field theory adds to the sparse-features picture. Lin 2025 makes progress on exactly this, showing that eigenanalysis of the empirical NTK surfaces ground-truth features in toy models of superposition and recovers Fourier feature families in modular arithmetic. A layerwise eNTK localizes features to specific layers, and the evolution of its spectrum flags the grokking transition — a feature-identification method that doubles as an observable.

Circuits · Belief States

How do the fractal and multiscale properties of belief state representations connect to multiscale noise and structure in realistic models?

▾

Transformers trained on text generated by hidden Markov models learn to linearly represent optimal belief states, and the representations and mechanisms are understood in quite a bit of detail in simple cases (Piotrowski et al. 2025). Any text generation process can be approximated arbitrarily well with a finite HMM, making this a clean playground for understanding representations with hidden context variables.

While the optimal belief state propagation is deterministic, the emergent algorithm has interesting fractal and multiscale properties which may be linked to multiscale structure and noise in realistic models. This connection has not yet been fleshed out.

Theoretical Idealization

Are learned computations in the mixed initialization regime characteristically different from those in the pure feature learning regime?

▾

Production models are often trained with initialization lengths that interpolate between lazy and feature learning regimes — the mixed-feature learning regime. It is currently unclear how to interpret the significance of training noise in this regime.

One interpretation holds that the mixed regime is in spirit the same as the feature learning regime, with initialization length only governing the sharpness of the sigmoid accuracy curve. Another interpretation asserts that computation in this regime is qualitatively different, with results emerging from noisy circuits — as seen in mean-field predictions for modular addition (Rubin et al. 2024) and the de-noising task (Vaintrob 2025).

Guiding Question

Can we design architectures and training procedures that produce representations amenable to mechanistic analysis from the outset? What inductive biases most reliably lead to interpretable internal structure, and how do we verify that interpretability is preserved as systems scale?

Architectures

Can tensor network architectures provide structured factorizations that make internal representations geometrically tractable at scale?

▾

Tensor network architectures impose structured factorizations on weight matrices, making internal representations geometrically tractable. Weight-sparse models constrain the effective degrees of freedom, reducing the combinatorial complexity of circuit search. Both approaches attempt to build interpretable structure in from the outset rather than discovering it post hoc.

Mack et al. 2026 pursue a related factorization. The ParityTransformer replaces a learned over-complete basis at each layer with a parameter-free algebraic dictionary, giving a deterministic incoherence guarantee and removing the memory cost that has made per-layer interpretable bottlenecks impractical at GPT-2 scale. Because subsequent computation acts only on features that survive the bottleneck, those features are native to the forward pass rather than recovered post hoc.

Training Procedures

Can gradient routing and objective engineering induce modular, identifiable circuit structure that persists under standard fine-tuning?

▾

Gradient routing techniques selectively direct learning signals to specific subnetworks, encouraging functional specialization. This can induce modular structure not present under standard training, making circuits more identifiable. MELBO and related objective engineering approaches explicitly reward disentanglement, modularity, or sparsity of internal representations — building interpretability into the learning signal itself.

Guiding Question

Can we build unsupervised tools for reconstructing learned algorithms and representations from the weights and activations of deployed models — without requiring labeled ground truth? What theoretical guarantees can we provide about the completeness and faithfulness of such reconstructions?

Activation-Based

What replaces SAEs when sparsity as a proxy for feature identity competes with compositionality and hierarchical structure?

▾

SAEs decompose residual stream activations into sparse linear combinations of learned feature directions. While achieving notable successes, their reliance on sparsity as a proxy for feature identity competes with compositionality and hierarchical structure. SAEs are based on an incomplete data model — designed to account for sparsity, whose optimization competes with other properties we wish to capture.

Mack, Panickssery and Turner 2026 attack the unsupervised side directly. Causal Perturbative Elicitation decomposes a deep transformer slice by tensor decomposition to learn interpretable low-rank adapters from a single example, surfacing hidden failure modes such as sandbagging without any labelled target.

Physics-informed alternatives may address this by decomposing the learned weight distribution into its saddle-point component (circuits) and Gaussian fluctuations (noise), rather than decomposing activations naively into sparse dictionaries.

Activation-Based

Can activation manifold geometry provide a more faithful characterization of learned representations than dictionary-based decompositions?

▾

Rather than decomposing activations into fixed dictionaries, manifold-based approaches characterize the geometry of activation spaces directly — identifying low-dimensional structure, curvature, and topological features that reflect the network's learned representations. This approach avoids the sparsity assumption entirely and may be more robust to the kinds of distributed, compositional representations that SAEs struggle with.

Weight-Based

Can direct analysis of weight matrices via tensor decompositions recover functional subnetworks without requiring activation data?

▾

Direct analysis of weight matrices, via SVD, tensor decompositions, or graph-theoretic methods, aims to identify functional subnetworks without requiring activation data at all. This is particularly relevant for understanding circuits that are sparse in weight space rather than activation space, and may be more robust to distribution shift in the deployment setting.

Physics-Informed
Ambitious
Mechanistic Interpretability

Mechanistic Interpretability Is Ready for Its
Thomson Moment

A History of Mechanistic Interpretability

A History of Thermometry

A Physics-Informed Science of AI Safety is Important and Possible

PIAMI
Working Groups

PIAMI Events and Opportunities

Ready to piece the elephant together?

Physics-InformedAmbitiousMechanistic Interpretability

Mechanistic Interpretability Is Ready for ItsThomson Moment

A History of Mechanistic Interpretability

A History of Thermometry

A Physics-Informed Science of AI Safety is Important and Possible

PIAMIWorking Groups

PIAMI Events and Opportunities

Ready to piece the elephant together?

Physics-Informed
Ambitious
Mechanistic Interpretability

Mechanistic Interpretability Is Ready for Its
Thomson Moment

PIAMI
Working Groups