We believe that ambitious mechanistic interpretability is possible, and we believe physics is the answer. A collaboration at the intersection of statistical physics, deep learning theory, and AI safety.
To understand where mechanistic interpretability stands today, it helps to place it alongside a surprisingly close historical parallel: the development of thermometry in the nineteenth century. Both fields faced the same foundational impasse. Both needed a Thomson.
The case for physics-inspired interpretability rests on two distinct but reinforcing claims: that a scientific theory of deep learning is already emerging from the physics community, and that statistical physics has developed exactly the tools needed to address interpretability's deepest unsolved problems.
We are interested in developing theoretical frameworks that make increasingly accurate predictions about models trained on natural data — including their internal representations and inference algorithms. This requires understanding how the structure of data interplays with learning algorithms and learned representations — and translating that understanding into practical safety tools.
Can we construct idealized models rich enough to capture intrinsic properties of data structure (hierarchical, compositional, sparse, sequential) while remaining tractable enough to make quantitative predictions? What observables do these models share with natural data, and how do they constrain which theories of data structure are actually useful?
Scaling laws have been a key observable for physics-inspired approaches. Cagnetta 2026 manage to quantitatively predict exponents of data-limited scaling laws by looking at two quantities: how pairwise token correlations decay as they become further apart, and how the next-token conditional entropy decays with respect to the length of the conditioning context.
Coppola et al. 2025 take this physics analogy further, developing a renormalization group (RG) framework for learning curves of weakly non-linear networks trained on power-law distributed data. By treating training as an RG flow that progressively integrates high-frequency modes of the data, they analyze the self-similarity and universality of scaling laws. A notable finding is that features typically neglected in standard treatments — such as the discreteness of the data spectrum and lack of translation invariance — lead to both quantitative and qualitative departures from conventional perturbative RG predictions.
Brill 2024 introduces a percolation model of natural data to study scaling laws, showing that hierarchical, sparse structure gives rise to power-law scaling regimes. Brill 2025 extends this to representation learning, using the same random lattice model to categorize learned features by their compositional role of context, component, and surface.
In modelling language specifically, Cagnetta 2024a and 2024b introduce Probabilistic Context Free Grammars (PCFGs) and use them to study how latent hierarchical tree-like structure in grammar governs token-to-token correlations, and how the effective range of these correlations is governed by training set size. Wyart 2025 utilize diffusion models to probe latent hierarchical structure in natural data.
The modeling of sequential data can be found in computational mechanics (Shalizi and Crutchfield 1999), which in a hidden Markov model setting studies minimal sufficient statistics (MSS) for optimal generation and prediction. More recently, Rosas et al. 2025 extends this to track MSS for input-output processes, giving a natural formalism to study (PO)MDPs — the classical setting of reinforcement learning agents.
Wentworth 2025 explores this in a Bayesian setting, showing that if two agents agree on predictions over a pair of observables but use different latent variables, any two such "natural latents" are isomorphic — with results robust to approximation. The conditions are that the latent must mediate between the observables and be recoverable from either one individually.
Eisenstat 2025 generalises this to the case where an agent's world-model consists of a structured family of latent variables. Under a condition called perfect condensation, any two such latent families over the same observables must correspond, with an approximate version of the correspondence theorem holding more generally via an information-theoretic inequality.
How do different training phenomena — hierarchical learning, phase transitions, multi-scale dynamics — interact with data structure to shape internal representations? When can the noise generated by training be treated as a perturbative correction, and how does this interact with noise from initialization or finite sample size?
NNFT uses mean-field theory, treating the GP component as a higher-order correction to an adaptive background. The mean-field background can encode arbitrary correlations between layers and neurons — getting around NNGP++ expressivity limits. This suggests viewing "circuits" as shifts in the Bayesian posterior's kernel structure rather than fixed computational structures.
The framework has genuine limitations. The space of possible distributions is a priori infinite-dimensional. Restricting to "tilts" (exponential families) improves tractability, reducing the saddle-point problem to finding a minimizer on a finite-dimensional space — but identifying the right parameterization of this tilt space for transformer-like architectures remains an open problem.
The mean-field description assumes neurons are statistically exchangeable — each drawn i.i.d. from a shared prior. This breaks down when a small number of neurons are highly specialized, playing qualitatively distinct roles that cannot be captured by any single distribution over rows. In physics language, these are instantons: non-perturbative, localized configurations outside the saddle-point + Gaussian fluctuation picture entirely.
Developing a unified theory that handles both the bulk and instanton-like specialized minority is a key future direction for interpretability. Such cases may be better described by Saxe-inspired analyses of linear network dynamics, where specialization emerges through singular value structure.
If NNFT is the right asymptotic description of the structures we care about, it gives us the right objects to reason about. Instead of asking why specific weights have specific values — a question with no clean answer — we ask what self-consistent distribution over neurons the network has converged to, and what computations that distribution implements.
This is a better-posed question, and one that makes contact with the circuit-level descriptions that interpretability already uses, but grounds them in a principled theory of why those circuits form, when they form, and how they scale.
Singular Learning Theory (SLT) paints an idealized picture of model behavior controlled by local geometric structure. The Real Log Learning Coefficient (RLCT) governs the geometry of the posterior near a local minimum; MFT describes the structure of that posterior as a saddle-point approximation.
Whether these descriptions are compatible, and whether the RLCT can be computed or bounded by MFT analysis, is an important open direction. Pursuing this would lend increased theoretical support for why SLT observables track circuit structure.
Theories of learning dynamics treat noise in two ways. The deterministic picture (Saxe et al. 2014, Abbe et al. 2023) studies discrete symmetry-breaking events — saddle-to-saddle transitions where different singular modes of the target function appear in sequence. The stochastic picture (Bordelon & Pehlevan 2022, 2026) uses DMFT to formulate the kernel as a dynamical object during training.
The circumstances under which one perspective provides a better theoretical explanation than the other remains open. Incorporating finite learning rate as a thermodynamic control parameter into DMFT frameworks — where it would enter alongside width, depth, and initialization scale as a regulator of the feature-learning regime — is also an important open direction.
On the elicitation end, Yue 2025 shows that base models achieve higher pass@k at large k on math, reasoning, and coding benchmarks than their RLVR fine-tuned counterparts, suggesting that original reasoning abilities originate from and are bounded by the base model. On the discovery end, Bush et al. 2025 provides mechanistic evidence that model-free RL algorithms can learn policies resembling system 2 reasoning.
Further investigation into the learnability of different policies and the characterization of their safety profiles is warranted. Hazan et al. 2025 propose a research program investigating learnability of stochastic processes by treating them as dynamical systems and studying their stability, mixing, observability, and spectral properties.
How can we classify the different kinds of representations, circuits, and algorithms learned in neural networks? What does each framework treat as its fundamental object and implicit complexity measure? How do theories of learning and representations inform interpretability tools?
Early work in NTK and Gaussian process models created a paradigm of feature learning as a physically-aligned model of representations, classifying inputs by their PCA in activation space. While valuable for lazy learning and NNGP++ paradigms, this approach works poorly for interpreting semantic features or mechanisms in sophisticated models.
A richer picture treats activation data as a kernel matrix — the matrix of activation dot products in the data basis — rather than raw PCA features. In some sense this is a vacuously general object: any two neural nets with the same activation dot products are equivalent from the point of view of training and Bayesian learning. Linking the data kernel matrix to more atomic and interpretable features beyond PCA has not yet been done in a satisfying way.
Transformers trained on text generated by hidden Markov models learn to linearly represent optimal belief states, and the representations and mechanisms are understood in quite a bit of detail in simple cases (Piotrowski et al. 2025). Any text generation process can be approximated arbitrarily well with a finite HMM, making this a clean playground for understanding representations with hidden context variables.
While the optimal belief state propagation is deterministic, the emergent algorithm has interesting fractal and multiscale properties which may be linked to multiscale structure and noise in realistic models. This connection has not yet been fleshed out.
Production models are often trained with initialization lengths that interpolate between lazy and feature learning regimes — the mixed-feature learning regime. It is currently unclear how to interpret the significance of training noise in this regime.
One interpretation holds that the mixed regime is in spirit the same as the feature learning regime, with initialization length only governing the sharpness of the sigmoid accuracy curve. Another interpretation asserts that computation in this regime is qualitatively different, with results emerging from noisy circuits — as seen in mean-field predictions for modular addition (Rubin et al. 2024) and the de-noising task (Vaintrob 2025).
Can we design architectures and training procedures that produce representations amenable to mechanistic analysis from the outset? What inductive biases most reliably lead to interpretable internal structure, and how do we verify that interpretability is preserved as systems scale?
Tensor network architectures impose structured factorizations on weight matrices, making internal representations geometrically tractable. Weight-sparse models constrain the effective degrees of freedom, reducing the combinatorial complexity of circuit search. Both approaches attempt to build interpretable structure in from the outset rather than discovering it post hoc.
Gradient routing techniques selectively direct learning signals to specific subnetworks, encouraging functional specialization. This can induce modular structure not present under standard training, making circuits more identifiable. MELBO and related objective engineering approaches explicitly reward disentanglement, modularity, or sparsity of internal representations — building interpretability into the learning signal itself.
Can we build unsupervised tools for reconstructing learned algorithms and representations from the weights and activations of deployed models — without requiring labeled ground truth? What theoretical guarantees can we provide about the completeness and faithfulness of such reconstructions?
SAEs decompose residual stream activations into sparse linear combinations of learned feature directions. While achieving notable successes, their reliance on sparsity as a proxy for feature identity competes with compositionality and hierarchical structure. SAEs are based on an incomplete data model — designed to account for sparsity, whose optimization competes with other properties we wish to capture.
Physics-informed alternatives may address this by decomposing the learned weight distribution into its saddle-point component (circuits) and Gaussian fluctuations (noise), rather than decomposing activations naively into sparse dictionaries.
Rather than decomposing activations into fixed dictionaries, manifold-based approaches characterize the geometry of activation spaces directly — identifying low-dimensional structure, curvature, and topological features that reflect the network's learned representations. This approach avoids the sparsity assumption entirely and may be more robust to the kinds of distributed, compositional representations that SAEs struggle with.
Direct analysis of weight matrices — via SVD, tensor decompositions, or graph-theoretic methods — aims to identify functional subnetworks without requiring activation data at all. This is particularly relevant for understanding circuits that are sparse in weight space rather than activation space, and may be more robust to distribution shift in the deployment setting.
Long-form pieces from PIAMI researchers on the science, history, and practice of physics-inspired interpretability.
We facilitate interdisciplinary research by running workshops at the intersection of statistical physics and AI safety.
A workshop bringing together statistical physicists, deep learning theorists, and AI safety researchers to collaboratively develop the PIAMI research roadmap.
Future workshops and events will be listed here. Sign up for our newsletter to be notified.
PrincInt is an AI Safety field-building organization focused on supporting interdisciplinary collaborations aimed at providing high-assurance safety guarantees for AGI. We facilitate research collaborations, run an internal research division, incubate and fiscally sponsor new academic labs, act as a regrantor, and run a research fellowship program.
Our internal research division, PIRAMID, works directly on the PIAMI agenda.
PIAMI is an open collaboration. Whether you're a researcher, faculty member, or institution, there's a path for you to contribute.
Help us define the research roadmap. Your perspective shapes the direction of PIAMI.
If you're on the job market, join PrincInt or one of our affiliated organizations — we'll be hiring.
We can help procure seed funding for graduate students at academic institutions working on PIAMI topics.
Considering leaving academia? We can help procure funding and fiscally sponsor an independent research org.
Sign up to mentor researchers in our summer and winter fellowship programs.
Work directly on open items in the research roadmap. Reach out and we'll find the right fit.
We're building the principled science that ambitious mechanistic interpretability has always needed. Come help write the next chapter.