Hi, I'm Trenton.

I’m a Member of Technical Staff on the Mechanistic Interpretability team at Anthropic. A nice overview of our mission can be found here. I’m currently working on using dictionary learning to disentangle superposition in artificial neural networks.

Information about me:

I have a PhD in Systems Biology from Harvard. My thesis was on “Sparse Representations in Biological and Artificial Neural Networks” in the Kreiman Lab with support from the NSF Graduate Research Fellowship. I also spent time at the Berkeley Redwood Center for Theoretical Neuroscience as a visiting researcher.
I graduated from Duke University in May 2020 with a self-made major in “Minds and Machines: Biological and Artificial Intelligence”. I was lucky to attend as a Robertson Scholar, which provided full funding during all four years, including summer experiences.
At Duke, I spent a year doing research in Dr. Michael Lynch’s Lab attempting to use machine learning to design new CRISPR guide RNAs for safer, more effective genome editing. Afterwards, I was affiliated with Dr. Debora Marks’s Lab at Harvard Medical School applying deep learning to protein design.

I am involved in the movement/philosophy/set of ideas that is Effective Altruism. I am also a fan of prediction markets and make public forecasts on Metaculus here. If the world was void of both interesting research questions and global catastrophic risks(!), you’d find me backpacking around the world with my film camera. I still try to do this when I have time off and get the chance to travel somewhere cool.

Have any feedback for me? Please consider filling out this anonymous feedback form so I can learn and grow.

Publications (in reverse chronological order):

On the Biology of a Large Language Model
Jack Lindsey†, Wes Gurnee*, Emmanuel Ameisen*, Brian Chen*, Adam Pearce*, Nicholas L. Turner*, Craig Citro*, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson*
\† Lead Contributor
*(Core Contributor)
Anthropic, March 2025
[paper] [blog-post] [tweet-thread]

Circuit Tracing: Revealing Computational Graphs in Language Models
Emmanuel Ameisen*, Jack Lindsey*, Adam Pearce*, Wes Gurnee*, Nicholas L. Turner*, Brian Chen*, Craig Citro*, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson*
*(Core Contributor)
Anthropic, March 2025
[paper] [blog-post] [tweet-thread]

Auditing Language Models for Hidden Objectives
Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, Jan Kirchner, Jan Leike, Austin Meek, Kei Nishimura-Gasparian, Euan Ong, Christopher Olah, Adam Pearce, Fabien Roger, Jeanne Salle, Andy Shih, Meg Tong, Drake Thomas, Kelley Rivoire, Adam Jermyn, Monte MacDiarmid, Tom Henighan, Evan Hubinger
Anthropic, March 2025
[paper] [tweet-thread]

Insights on Crosscoder Model Diffing
Siddharth Mishra-Sharma, Trenton Bricken, Jack Lindsey, Adam Jermyn, Jonathan Marcus, Kelley Rivoire, Christopher Olah, Thomas Henighan
Anthropic, February 2025
[paper]

Stage-Wise Model Diffing
Trenton Bricken, Siddharth Mishra-Sharma, Jonathan Marcus, Adam Jermyn, Christopher Olah, Kelley Rivoire, Thomas Henighan
Anthropic, December 2024
[paper] [tweet-thread]

Using Dictionary Learning Features as Classifiers
Trenton Bricken, Jonathan Marcus, Siddharth Mishra-Sharma, Meg Tong, Ethan Perez, Mrinank Sharma, Kelley Rivoire, Thomas Henighan; edited by Adam Jermyn
Anthropic, October 2024
[paper] [tweet-thread]

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Adly Templeton*, Tom Conerly*, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan
*(Core Contributor)
Anthropic, May 2024
[paper] [blog-post] [tweet-thread]

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, Chris Olah
*(Core Contributor)
Anthropic, October 2023
[paper] [blog-post] [tweet-thread]

Emergence of Sparse Representations from Noise
Trenton Bricken*, Rylan Schaeffer, Bruno Olshausen, Gabriel Kreiman
*(First author)
ICML, May 2023
[paper]

Sparse Distributed Memory is a Continual Learner
Trenton Bricken*, Xander Davies, Deepak Singh, Dmitry Krotov, Gabriel Kreiman
*(First author)
ICLR, September 2022
[paper] [tweet-thread]

Attention Approximates Sparse Distributed Memory
Trenton Bricken*, Cengiz Pehlevan
*(First author)
NeurIPS, December 2021
[paper] [blog-post] [tweet-thread]

MIT Center for Brains Minds+ Machines Talk:

I gave a longer talk that enabled me to cover more of SDM’s biological plausibilty to the VSA Online community here.

High-content screening of coronavirus genes for innate immune suppression reveals enhanced potency of SARS-CoV-2 proteins.
Erika J Olson*, David M Brown*, Timothy Z Chang, Lin Ding, Tai L Ng, H. Sloane Weiss, Peter Koch, Yukiye Koide, Nathan Rollins, Pia Mach, Tobias Meisinger, Trenton Bricken, Joshus Rollins, Yun Zhang, Colin Molloy, Yun Zhang, Briodget N Queenan, Timothy Mitchison, Debora Marks, Jeffrey C Way, John I Glass, Pamela A Silver
*(First authors)
bioRxiv, March 2021
[preprint] [tweet-thread]

Computationally Optimized SARS-CoV-2 MHC Class I and II Vaccine Formulations Predicted to Target Human Haplotype Distributions.
Ge Liu*, Brandon Carter*, Trenton Bricken, Siddhartha Jain, Mathias Viard, Mary Carrington, David K Gifford
*(First authors)
Cell Systems, July 2020
[paper] [code] [preprint] [tweet-thread]

My Google Scholar profile can be found here.

Talks:

My PhD Defense - March 2025
Stanford CS 25: Transformers United V2 - March 2023, [public-link]
Vector Symbolic Architectures Webinar Series - September 2022, [public-link]
Redwood Center for Theoretical Neuroscience - May 2022
MIT Center for Brains Minds+ Machines - November 2021, [public-link]

Podcasts:

Dwarkesh Podcast 2025 Episode:

Dwarkesh Podcast 2024 Episode:

Past Projects (in reverse chronological order):

Upside Down Free Energy - Fall 2020 - Motivated by progress in “Upside Down” supervised reinforcement learning, I tried to connect it Friston’s Free Energy Principle (FEP) and develop more hierarchical versions of FEP. This required first implementing benchmarks of the existing “Upside Down” RL algorithms (see next entry). I was starting to get somewhat interesting results but RL is really hard and I started down the rabbit hole of Sparse Distributed Memory. See the GitHub repository for a draft PDF write up. Thanks to Beren Millidge and Alec Tschantz for their supervision and discussions about this project.
RewardConditionedUDRL - Fall 2020 - Open source codebase combining implementations of Reward Conditioned Policies and Training Agents using Upside-Down Reinforcement Learning. The former had no public implementation and the latter had a few implemented as Jupyter Notebooks but that had a number of issues I flagged eg. here and here. I hope this open source codebase will serve to both fully replicate the aforementioned papers and be used as a starting point for further research in the exciting domain of supervised RL.
SARS-CoV-2 mutation effects and 3D structure prediction from sequence covariation. - Summer 2020 - Collaborated with the Marks lab to help produce their SARS-CoV-2 mutation effect and 3D structure predictions using EVCouplings.
Website: https://marks.hms.harvard.edu/sars-cov-2
RL Learning Byzantine Fault Tolerant (BFT) Consensus Protocols - Senior Year - Supervised by Dr. Kartik Nayak, final class project turned research project. Investigated the ability for deep reinforcement learning agents to discover and prove BFT consensus protocols. This was a great way to learn more about reinforcement learning but the tasks were too difficult for the agents to learn given the algorithms we were attempting to use. A write up of the project and uncleaned version of the codebase is available here.
Protein Generation and Optimization - Supervised by Dr. Debora Marks’s Lab as my Senior Thesis - This research was motivated by the promise of recent developments in our ability to predict protein functionality and the problem of finding novel sequences that maximize this prediction. We tried developing a new solution using invertible neural networks and variational inference to approximate the intractable distribution of any protein function predictor with reason to believe it would outperform Markov Chain Monte Carlo methods. My senior thesis write up of the work and where it seemed to succeed and fail can be found with the codebase here.
PyTorch Discrete Normalizing Flows - Winter Break 2019 - Learning about Discrete Normalizing Flows from “Discrete Flows: Invertible Generative Models of Discrete Data”, by Dustin Tran et al. https://arxiv.org/pdf/1905.10347.pdf, I tried implementing them using the coded provided in edward2 but found that none of it worked. I ended up porting all of the code into PyTorch, which required making a number of modifications and getting it working on a toy example. This repo as of October 2022 has 95 Github stars and two developers have reached out to collaborate and help me replicate the results.
Tail Free Sampling - Independent project, advice from friends and mentors - Developed a new method to sample sequences from autoregressive neural networks for open-ended sequence generation.
Primary and Tertiary Protein AutoEncoder - Final Class Project - Investigated if a deep AutoEncoder could learn the relationship between protein sequence and tertiary structure in order to then do either sequence or structure optimization in the latent space. It didn’t work very well but I learned a lot!
Facebook Chatbot for Spaced Repetition Learning - HackDuke 2016 - Spaced Repetition is both wonderful and highly neglected. Can we make it more popular and easy to do routinely using a Facebook Chatbot to both harass and motivate us? Got everything working! But there were always more bugs and this didn’t solve the fundamental problem of spaced repetition learning still taking a huge amount of motivation. You could argue that presenting the cards over Messenger just created more distractions.

Other Locations on the Interwebs

I am pretty active on Twitter. My DMs are open and you should feel free to reach out but I can’t promise I’ll be good at replying! I sometimes upload my film photography to Instagram and to my portfolio.