Bibliographic record
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
- Authors
- Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen
- Publication year
- 2025
- OA status
- oa_green
Print
Need access?
Ask circulation staff for physical copies or request digital delivery via Ask a Librarian.
Digital copy
Unavailable in your region (PD status unclear).
Abstract
Adapting language models (LMs) to new tasks via post-training carries the
risk of degrading existing capabilities -- a phenomenon classically known as
catastrophic forgetting. In this paper, toward identifying guidelines for
mitigating this phenomenon, we systematically compare the forgetting patterns
of two widely adopted post-training methods: supervised fine-tuning (SFT) and
reinforcement learning (RL). Our experiments reveal a consistent trend across
LM families (Llama, Qwen) and tasks (instruction following, general knowledge,
and arithmetic reasoning): RL leads to less forgetting than SFT while achieving
comparable or higher target task performance. To investigate the cause for this
difference, we consider a simplified setting in which the LM is modeled as a
mixture of two distributions, one corresponding to prior knowledge and the
other to the target task. We identify that the mode-seeking nature of RL, which
stems from its use of on-policy data, enables keeping prior knowledge intact
when learning the target task. We then verify this insight by demonstrating
that the use on-policy data underlies the robustness of RL to forgetting in
practical settings, as opposed to other algorithmic choices such as the KL
regularization or advantage estimation. Lastly, as a practical implication, our
results highlight the potential of mitigating forgetting using approximately
on-policy data, which can be substantially more efficient to obtain than fully
on-policy data.
risk of degrading existing capabilities -- a phenomenon classically known as
catastrophic forgetting. In this paper, toward identifying guidelines for
mitigating this phenomenon, we systematically compare the forgetting patterns
of two widely adopted post-training methods: supervised fine-tuning (SFT) and
reinforcement learning (RL). Our experiments reveal a consistent trend across
LM families (Llama, Qwen) and tasks (instruction following, general knowledge,
and arithmetic reasoning): RL leads to less forgetting than SFT while achieving
comparable or higher target task performance. To investigate the cause for this
difference, we consider a simplified setting in which the LM is modeled as a
mixture of two distributions, one corresponding to prior knowledge and the
other to the target task. We identify that the mode-seeking nature of RL, which
stems from its use of on-policy data, enables keeping prior knowledge intact
when learning the target task. We then verify this insight by demonstrating
that the use on-policy data underlies the robustness of RL to forgetting in
practical settings, as opposed to other algorithmic choices such as the KL
regularization or advantage estimation. Lastly, as a practical implication, our
results highlight the potential of mitigating forgetting using approximately
on-policy data, which can be substantially more efficient to obtain than fully
on-policy data.
Copies & availability
Realtime status across circulation, reserve, and Filipiniana sections.
Self-checkout (no login required)
- Enter your student ID, system ID, or full name directly in the table.
- Provide your identifier so we can match your patron record.
- Choose Self-checkout to send the request; circulation staff are notified instantly.
| Barcode | Location | Material type | Status | Action |
|---|---|---|---|---|
| No holdings recorded. | ||||
Digital files
Preview digitized copies when embargo permits.
-
View digital file
original
APPLICATION/PDF · 965 KB
Links & eResources
Access licensed or open resources connected to this record.
- oa Direct