Interpretability at scale
Mechanistic methods for understanding the computations performed by frontier transformer models. Circuit discovery, feature attribution, and the systematic mapping of model behavior to model internals.
An independent laboratory studying interpretability, alignment, and the theoretical basis of large-scale neural network behavior. We publish openly, share code and weights, and turn down corporate funding that requires us to withhold either.
Reverson Labs runs four long-horizon research programs. We define a program as a question we expect to take at least three years and a small team to make meaningful progress on. We change them rarely.
Mechanistic methods for understanding the computations performed by frontier transformer models. Circuit discovery, feature attribution, and the systematic mapping of model behavior to model internals.
Formal frameworks for specifying, monitoring, and steering the behavior of trained systems. Particular focus on settings where the human supervisory signal is incomplete, ambiguous, or systematically biased.
The mathematical structure of large-scale training: feature emergence, grokking phenomena, scaling laws, and the implicit biases of stochastic gradient descent on overparametrized networks.
Honest, hard-to-game benchmarks for high-stakes capabilities. Held-out test design, contamination detection, and the construction of evaluations that scale with model capability.
Selected recent publications, in reverse chronological order. Every paper here has been posted to arXiv, and we release code and trained weights for every empirical result. Our complete publication list lives in the library.
Reverson Labs is permanently small. We hire only when a research program has a specific gap we can't fill internally, and we hire people who could otherwise get tenure-track offers — but who prefer to write papers, share code, and not teach undergraduate calculus.
PhD Stanford '14, postdoc at OpenAI & DeepMind. Founder of the lab. Research focus: interpretability and alignment theory.
profile · papers →PhD MIT '17, alignment theorist. Co-author of seven NeurIPS spotlight papers since joining. Leads the alignment program.
profile · papers →PhD Tokyo '18, previously at Anthropic interpretability. Leads the interpretability program. First author on Sparse Circuit Discovery (Nov 2025).
profile · papers →PhD Carnegie Mellon '19, learning theorist. Leads work on training dynamics and the mathematical structure of generalization.
profile · papers →All positions are full-time, based in Berkeley, and come with the standard package: competitive salary, full benefits, full compute allocation, full credit on published work. We sponsor visas. We hire on quality, not pedigree.
The Reverson group's circuit-discovery framework has become the de-facto vocabulary for the field — a small lab punching far above its weight on mechanistic interpretability.
Reverson Labs continues to publish some of the most theoretically rigorous alignment work in the field — and to release all of it openly, which is rarer than it should be.
We publish everything openly. We're glad to talk with researchers, journalists, and policy people who want to dig into specific questions — or who think we might be wrong.