
Research Engineer at
Google DeepMind
Building open-ended systems that keep inventing, learning, and surprising us.
Previously at
Reka AI building multimodal agents, and an AI Resident at
FAIR (Meta) — core contributor to Llama 3 and co-lead of Rainbow Teaming. Master's from
Mila with Irina Rish; GFlowNets for drug discovery at
Recursion. Off the clock: long runs, cooking, books, a camera.

Open 405B-parameter foundation models with a 128K context window, native tool use, and multilingual support — performance comparable to GPT-4, released openly.
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
@misc{team2024the,
title = {The Llama 3 Herd of Models},
author = {Llama Team},
year = {2024},
eprint = {2407.21783},
archivePrefix = {arXiv}
}
8B and 70B pretrained and instruction-tuned models that set state-of-the-art performance at their scales on release.
@misc{teamllama3,
title = {Llama-3 Preview Models},
author = {Llama Team},
url = {https://ai.meta.com/blog/meta-llama-3/}
}
Quality-diversity search that automatically generates diverse adversarial prompts, exposing LLM vulnerabilities and producing data that measurably improves robustness.
As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to adversarial attacks is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem and uses open-ended search to generate prompts that are both effective and diverse. Focusing on the safety domain, we use Rainbow Teaming to target various state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90% across all tested models. Furthermore, we demonstrate that prompts generated by Rainbow Teaming are highly transferable and that fine-tuning models with synthetic data generated by our method significantly enhances their safety without sacrificing general performance or helpfulness. We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity, showcasing its potential to drive robust open-ended self-improvement in a wide range of applications.
@inproceedings{samvelyan2024rainbow,
title = {Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts},
author = {Mikayel Samvelyan and Sharath Chandra Raparthy and Andrei Lupu and Eric Hambro and Aram H. Markosyan and Manish Bhatt and Yuning Mao and Minqi Jiang and Jack Parker-Holder and Jakob Foerster and Tim Rocktäschel and Roberta Raileanu},
booktitle = {Neural Information Processing Systems (NeurIPS), 2024},
year = {2024},
eprint = {2402.16822},
archivePrefix = {arXiv}
}
Stepwise reward models decide when, where, and how to refine LLM reasoning, lifting a strong RL-finetuned Llama-2 13B by 12% on GSM8K.
State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify when and where to refine without access to external feedback. Outcome-based Reward Models (ORMs), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (PRMs), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (SORMs) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train global refinement models, which take only the question and a draft solution as input and predict a corrected solution, and local refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53% to 65% when greedily sampled.
@inproceedings{havrilla2024glore,
title = {GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements},
author = {Alex Havrilla and Sharath Chandra Raparthy and Christoforus Nalmpantis and Jane Dwivedi-Yu and Maksym Zhuravinskyi and Eric Hambro and Roberta Raileanu},
booktitle = {International Conference on Machine Learning (ICML), 2024},
year = {2024},
eprint = {2402.10963},
archivePrefix = {arXiv}
}
A systematic comparison of expert iteration, PPO, and return-conditioned RL for reasoning — expert iteration proves surprisingly competitive.
Reinforcement Learning from Human Feedback (RLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (PPO), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (SFT) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of 10^6 samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.
@misc{havrilla2024teaching,
title = {Teaching Large Language Models to Reason with Reinforcement Learning},
author = {Alex Havrilla and Yuqing Du and Sharath Chandra Raparthy and Christoforos Nalmpantis and Jane Dwivedi-Yu and Maksym Zhuravinskyi and Eric Hambro and Sainbayar Sukhbaatar and Roberta Raileanu},
year = {2024},
eprint = {2403.04642},
archivePrefix = {arXiv}
}
Transformers trained on diverse offline trajectories learn brand-new sequential decision-making tasks in-context from just a handful of demonstrations.
Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this paper, we use an illustrative example to show that naively applying transformers to sequential decision making problems does not enable in-context learning of new tasks. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to in-context learning of new sequential decision making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity, and trajectory burstiness, all result in better in-context learning of new out-of-distribution tasks. By training on large diverse offline datasets, our model is able to learn new MiniHack and Procgen tasks without any weight updates from just a handful of demonstrations.
@inproceedings{raparthy2023generalization,
title = {Generalization to New Sequential Decision Making Tasks with In-Context Learning},
author = {Sharath Chandra Raparthy and Eric Hambro and Robert Kirk and Mikael Henaff and Roberta Raileanu},
booktitle = {International Conference on Machine Learning (ICML), 2024},
year = {2023},
eprint = {2312.03801},
archivePrefix = {arXiv}
}
Conditional GFlowNets that sample diverse Pareto-optimal candidates, outperforming prior methods for multi-objective drug and materials design.
We study the problem of generating diverse candidates in the context of Multi-Objective Optimization. In many applications of machine learning such as drug discovery and material design, the goal is to generate candidates which simultaneously optimize a set of potentially conflicting objectives. Moreover, these objectives are often imperfect evaluations of some underlying property of interest, making it important to generate diverse candidates to have multiple options for expensive downstream evaluations. We propose Multi-Objective GFlowNets (MOGFNs), a novel method for generating diverse Pareto optimal solutions, based on GFlowNets. We introduce two variants of MOGFNs: MOGFN-PC, which models a family of independent sub-problems defined by a scalarization function, with reward-conditional GFlowNets, and MOGFN-AL, which solves a sequence of sub-problems defined by an acquisition function in an active learning loop. Our experiments on wide variety of synthetic and benchmark tasks demonstrate advantages of the proposed methods in terms of the Pareto performance and importantly, improved candidate diversity, which is the main contribution of this work.
@inproceedings{jain2022multiobjective,
title = {Multi-Objective GFlowNets},
author = {Moksh Jain and Sharath Chandra Raparthy and Alex Hernandez-Garcia and Jarrid Rector-Brooks and Yoshua Bengio and Santiago Miret and Emmanuel Bengio},
booktitle = {International Conference on Machine Learning (ICML), 2023},
year = {2022},
eprint = {2210.12765},
archivePrefix = {arXiv}
}
Disentangles attention's search and retrieval steps and composes them dynamically — a drop-in replacement for multi-head attention.
Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interactions, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner through an additional soft competition stage between the query-key combination and value pairing. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval, and can easily be implemented in lieu of standard attention heads in any network architecture.
@inproceedings{mittal2021compositional,
title = {Compositional Attention: Disentangling Search and Retrieval},
author = {Sarthak Mittal and Sharath Chandra Raparthy and Irina Rish and Yoshua Bengio and Guillaume Lajoie},
booktitle = {International Conference for Learning Representations (ICLR), 2022},
year = {2021},
eprint = {2110.09419},
archivePrefix = {arXiv}
}
Formalizes continual RL as scalable MDPs, proves they exhibit polynomial mixing times, and proposes three sample-efficient algorithms.
The mixing time of the Markov chain induced by a policy limits performance in real-world continual learning scenarios. Yet, the effect of mixing times on learning in continual reinforcement learning (RL) remains underexplored. In this paper, we characterize problems that are of long-term interest to the development of continual RL, which we call scalable MDPs, through the lens of mixing times. In particular, we theoretically establish that scalable MDPs have mixing times that scale polynomially with the size of the problem. We go on to demonstrate that polynomial mixing times present significant difficulties for existing approaches, which suffer from myopic bias and stale bootstrapped estimates. To validate our theory, we study the empirical scaling behavior of mixing times with respect to the number of tasks and task duration for high performing policies deployed across multiple Atari games. Our analysis demonstrates both that polynomial mixing times do emerge in practice and how their existence may lead to unstable learning behavior like catastrophic forgetting in continual learning settings.
@inproceedings{riemer2021continual,
title = {Continual Learning In Environments With Polynomial Mixing Times},
author = {Matthew Riemer and Sharath Chandra Raparthy and Ignacio Cases and Gopeshh Subbaraj and Maximilian Puelma Touzel and Irina Rish},
booktitle = {Neural Information Processing Systems (NeurIPS), 2022},
year = {2021},
eprint = {2112.07066},
archivePrefix = {arXiv}
}
Shows MAML is highly sensitive to task distributions; learning a task curriculum instead of uniform sampling substantially improves adaptation.
Gradient-based meta-learners such as Model-Agnostic Meta-Learning (MAML) have shown strong few-shot performance in supervised and reinforcement learning settings. However, specifically in the case of meta-reinforcement learning (meta-RL), we can show that gradient-based meta-learners are sensitive to task distributions. With the wrong curriculum, agents suffer the effects of meta-overfitting, shallow adaptation, and adaptation instability. In this work, we begin by highlighting intriguing failure cases of gradient-based meta-RL and show that task distributions can wildly affect algorithmic outputs, stability, and performance. To address this problem, we leverage insights from recent literature on domain randomization and propose meta Active Domain Randomization (meta-ADR), which learns a curriculum of tasks for gradient-based meta-RL in a similar as ADR does for sim2real transfer. We show that this approach induces more stable policies on a variety of simulated locomotion and navigation tasks. We assess in- and out-of-distribution generalization and find that the learned task distributions, even in an unstructured task space, greatly improve the adaptation performance of MAML. Finally, we motivate the need for better benchmarking in meta-RL that prioritizes generalization over single-task adaption performance.
@inproceedings{mehta2020curriculum,
title = {Curriculum in Gradient-Based Meta-Reinforcement Learning},
author = {Bhairav Mehta and Tristan Deleu and Sharath Chandra Raparthy and Christopher Pal and Liam Paull},
booktitle = {ICLR BeTR-RL Workshop, 2021},
year = {2020},
eprint = {2002.07956},
archivePrefix = {arXiv}
}
Adds curiosity-driven exploration to neural-augmented simulators, improving sim-to-real transfer of robot policies.
@inproceedings{raparthy2021cunas,
title = {CuNAS — CUriosity-driven Neural-Augmented Simulator},
author = {Sharath Chandra Raparthy and Melissa Mozifian and Liam Paull and Florian Golemo},
booktitle = {RSS Sim2Real Workshop, 2021},
year = {2021},
url = {https://docs.google.com/presentation/d/1nVbt0iQKFTOgHEQLLHbn1Wy3bMs_mWpyzfc0aZsN30U/edit?usp=sharing}
}