Hi, I'm Meg!
I currently work on research infrastructure at Anthropic. Before that, I worked in research at Anthropic, crypto trading & quant finance, and studied computer science & physical sciences. I also love music, and used to do a bunch of singing semi-professionally.
Research that I worked a lot on:
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" [2023]: LLMs fail to represent facts in an order-invariant way
Towards Understanding Sycophancy in Language Models [2023]: LLMs trained with Reinforcement Learning from Human Feedback show sycophantic behavior which may be caused by human preferences
Forecasting Rare Language Model Behaviors [2025]: We introduce a method that forecasts risks in deployment across OOMs more queries than tested in evaluation
Taken out of context: On measuring situational awareness in LLMs [2023]: LLMs can successfully pass tests in-context having only seen descriptions of the tests in their training data
Research I've contributed to:
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training [2024]: We find that deceptive behaviors in LLMs can resist standard safety training techniques
Many-shot Jailbreaking [2024]
Evaluating feature steering: A case study in mitigating social biases [2024]
Steering Llama 2 via Contrastive Activation Addition [2023]: We develop a technique to control language model outputs by adding steering vectors to activations during inference