I am a Research Engineer at Google DeepMind, working in the Open-Endedness team.
Previously, I was a Member of Technical Staff at Reka AI, building general-purpose multimodal agents. Before that, I was an AI Resident at FAIR (Meta), where I was a core contributor to Llama 3 — shipping tool-use and mathematical reasoning capabilities — and co-led Rainbow Teaming, a method for stress-testing and improving LLM robustness at scale. My research spans LLM reasoning, open-ended learning, and in-context reinforcement learning.
I hold a Master's (with thesis) from Mila, advised by Irina Rish, and spent time at Recursion applying GFlowNets to drug discovery.
When not training models, you'll find me running long distances, cooking, reading, or out with a camera.
We open-source Llama 3.1, a new family of foundation models with native support for multilinguality, coding, reasoning, and tool usage, featuring a 405B-parameter architecture with 128K context window. The models show comparable performance to GPT-4 across various tasks, and include Llama Guard 3 for safety.
We introduce Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. We achieve SOTA performance for LLM models at these scales.
Introducing Rainbow Teaming, a new method for generating diverse adversarial prompts for LLMs via LLMs. It's a versatile tool for diagnosing model vulnerabilities across domains and creating data to enhance robustness & safety.
How to bootstrap the reasoning refinement capabilities of LLMs using synthetic data? We introduce GLoRe — applied on GSM8K, we can improve a strong RL finetuned Llama-2 13B by 12%.
In this work, we set out to understand how different algorithms fare at improving LLM reasoning from feedback. We compare expert iteration, PPO, and return-conditioned RL using Llama-2 as the base model.
Training autonomous agents to learn new tasks from few demonstrations is challenging, especially for sequential decision making which is sensitive to errors. We show that training transformers on diverse offline datasets of trajectories enables in-context learning of out-of-distribution sequential decision tasks from just a handful of demonstrations.
We examine multi-objective optimization in applications like drug discovery and material design, noting the failure of existing methods to achieve diverse Pareto-optimal candidates. We introduce Multi-Objective GFlowNets (MOGFNs), featuring a novel Conditional GFlowNet that outperforms existing methods in Hypervolume, R2-distance, and candidate diversity.
We view the standard Multi-Head attention mechanism from the "Search-Retrieval" perspective and highlight the rigid associations of keys and values. We propose Compositional Attention, a drop-in replacement where redundancies are addressed by disentangling Searches and Retrievals and composing them dynamically in a context-dependent way.
We concentrate on "Mixing time" of a Markov chain induced by a policy as a major contributor to poor scaling. We categorize continual RL problems as Scalable MDPs, formally demonstrate that these exhibit polynomial mixing times, and propose three algorithms which clearly demonstrate sample efficiency.
In this work we study the under-studied parameter in meta learning, "Task Distributions". We show that MAML is sensitive to task distributions, and learning a curriculum of tasks instead of uniformly sampling helps the adaptation performance substantially.
Transfer of policies from simulation to physical robots is an important open problem in deep RL. We propose a simple extension to Neural-Augmented Simulators based on artificial curiosity, leading to better exploration and consequently better sim-to-real transfer performance.