Reverson Labs — Research on the foundations of machine learning

Research on the foundations of machine learning⁰⁰³.

An independent laboratory studying interpretability, alignment, and the theoretical basis of large-scale neural network behavior. We publish openly, share code and weights, and turn down corporate funding that requires us to withhold either.

Four questions we think need answering. By someone. Possibly us.

Reverson Labs runs four long-horizon research programs. We define a program as a question we expect to take at least three years and a small team to make meaningful progress on. We change them rarely.

§ 01 · 14 papers

Interpretability at scale

Mechanistic methods for understanding the computations performed by frontier transformer models. Circuit discovery, feature attribution, and the systematic mapping of model behavior to model internals.

14papers 6researchers

§ 02 · 9 papers

Alignment theory

Formal frameworks for specifying, monitoring, and steering the behavior of trained systems. Particular focus on settings where the human supervisory signal is incomplete, ambiguous, or systematically biased.

9papers 5researchers

§ 03 · 11 papers

Learning dynamics

The mathematical structure of large-scale training: feature emergence, grokking phenomena, scaling laws, and the implicit biases of stochastic gradient descent on overparametrized networks.

11papers 4researchers

§ 04 · 7 papers

Evaluation methods

Honest, hard-to-game benchmarks for high-stakes capabilities. Held-out test design, contamination detection, and the construction of evaluations that scale with model capability.

7papers 3researchers

A few papers we're proud of. With a few we're still mildly anxious about.

Selected recent publications, in reverse chronological order. Every paper here has been posted to arXiv, and we release code and trained weights for every empirical result. Our complete publication list lives in the library.

001

NEW Sparse Circuit Discovery at Scale: Tracing 12-Layer Decision Pathways in Transformer Models

K. Aoki, M. Quincy, R. Ohlsen, P. Reverson

PreprintarXiv:2511.04827

pdf·code·bib

002

A Formal Framework for Outer Alignment Under Bounded Supervision

M. Quincy, T. Ife, P. Reverson

NeurIPS 2025Oral · Spotlight

pdf·code·bib

003

Phase Transitions in Skill Acquisition: A Bayesian Account of Grokking

R. Ohlsen, S. Nadar, K. Aoki

ICLR 2025Oral

pdf·code·bib

004

Contamination Audits for Long-Tail Evaluation: How Much Did the Model Already See?

T. Ife, M. Quincy, R. Ohlsen

ICML 2025Best paper · runner-up

pdf·code·bib

005

Feature Universality in Frontier Transformers: Evidence from Cross-Model Activation Patching

K. Aoki, P. Reverson, plus contributors at Google DeepMind

NeurIPS 2024Oral

pdf·code·bib

006

A Note on the Geometry of Gradient Flow Near Loss-Landscape Saddles

S. Nadar, R. Ohlsen

JMLR 2024Vol. 25 · 47 pp

pdf·code·bib

007

Eliciting Latent Knowledge via Self-Consistent Probing of Hidden States

M. Quincy, P. Reverson

ICLR 2024Spotlight

pdf·code·bib

Fourteen people, mostly in one building, mostly working on hard problems.

Reverson Labs is permanently small. We hire only when a research program has a specific gap we can't fill internally, and we hire people who could otherwise get tenure-track offers — but who prefer to write papers, share code, and not teach undergraduate calculus.

— DirectorJoined 2019

Priya Reverson

PhD Stanford '14, postdoc at OpenAI & DeepMind. Founder of the lab. Research focus: interpretability and alignment theory.

profile · papers →

— SeniorJoined 2020

Marlowe Quincy

PhD MIT '17, alignment theorist. Co-author of seven NeurIPS spotlight papers since joining. Leads the alignment program.

profile · papers →

— SeniorJoined 2022

Kenji Aoki

PhD Tokyo '18, previously at Anthropic interpretability. Leads the interpretability program. First author on Sparse Circuit Discovery (Nov 2025).

profile · papers →

— SeniorJoined 2021

Rivka Ohlsen

PhD Carnegie Mellon '19, learning theorist. Leads work on training dynamics and the mathematical structure of generalization.

profile · papers →

We're hiring for four roles. Applications are open, reviewed continuously.

All positions are full-time, based in Berkeley, and come with the standard package: competitive salary, full benefits, full compute allocation, full credit on published work. We sponsor visas. We hire on quality, not pedigree.

R · 01

Research Scientist · Interpretability

Program 01Mechanistic interpretability

PhD + 0—5 yrs

R · 02

Research Scientist · Alignment theory

Program 02Formal alignment

PhD + 2—10 yrs

E · 03

Research Engineer · Infrastructure

Cross-programTraining & eval tools

5+ yrs ML systems

V · 04

Visiting Scholar · Evaluation methods

Program 046—12 month residency

Faculty / postdoc

When other people cite us. Or write about us.

The Reverson group's circuit-discovery framework has become the de-facto vocabulary for the field — a small lab punching far above its weight on mechanistic interpretability.

Nature Machine Intelligence Editorial, May 2025

cit. 287

Reverson Labs continues to publish some of the most theoretically rigorous alignment work in the field — and to release all of it openly, which is rarer than it should be.

The Economist AI special report, Sept 2024

cit. 142

28,914

Total citations
across all published work

h-index across the lab
as of November 2025

Peer-reviewed papers
since founding, 2019

100%

Empirical results
with open code & weights

Research on the foundations of machine learning⁰⁰³.

Four questions we think need answering. By someone. Possibly us.

Interpretability at scale

Alignment theory

Learning dynamics

Evaluation methods

A few papers we're proud of. With a few we're still mildly anxious about.

NEW Sparse Circuit Discovery at Scale: Tracing 12-Layer Decision Pathways in Transformer Models

A Formal Framework for Outer Alignment Under Bounded Supervision

Phase Transitions in Skill Acquisition: A Bayesian Account of Grokking

Contamination Audits for Long-Tail Evaluation: How Much Did the Model Already See?

Feature Universality in Frontier Transformers: Evidence from Cross-Model Activation Patching

A Note on the Geometry of Gradient Flow Near Loss-Landscape Saddles

Eliciting Latent Knowledge via Self-Consistent Probing of Hidden States

Fourteen people, mostly in one building, mostly working on hard problems.

Priya Reverson

Marlowe Quincy

Kenji Aoki

Rivka Ohlsen

We're hiring for four roles. Applications are open, reviewed continuously.

Research Scientist · Interpretability

Research Scientist · Alignment theory

Research Engineer · Infrastructure

Visiting Scholar · Evaluation methods

When other people cite us. Or write about us.

Read our papers. Read the code. Get in touch.

Research on the foundations of machine learning003.

Four questions we think need answering. By someone. Possibly us.

Interpretability at scale

Alignment theory

Learning dynamics

Evaluation methods

A few papers we're proud of. With a few we're still mildly anxious about.

NEW Sparse Circuit Discovery at Scale: Tracing 12-Layer Decision Pathways in Transformer Models

A Formal Framework for Outer Alignment Under Bounded Supervision

Phase Transitions in Skill Acquisition: A Bayesian Account of Grokking

Contamination Audits for Long-Tail Evaluation: How Much Did the Model Already See?

Feature Universality in Frontier Transformers: Evidence from Cross-Model Activation Patching

A Note on the Geometry of Gradient Flow Near Loss-Landscape Saddles

Eliciting Latent Knowledge via Self-Consistent Probing of Hidden States

Fourteen people, mostly in one building, mostly working on hard problems.

Priya Reverson

Marlowe Quincy

Kenji Aoki

Rivka Ohlsen

We're hiring for four roles. Applications are open, reviewed continuously.

Research Scientist · Interpretability

Research Scientist · Alignment theory

Research Engineer · Infrastructure

Visiting Scholar · Evaluation methods

When other people cite us. Or write about us.

Read our papers. Read the code. Get in touch.

Research on the foundations of machine learning⁰⁰³.