RECKONING Outperforms FT-ICR in Knowledge Disentanglement: Robust Fact Recovery with Distractors

28 Oct 2025

Table of Links

Abstract and 1. Introduction

A. Dataset

B. In-context Reasoning with Distractors

C. Implementation Details

D. Adaptive Learning Rate

E. Experiments with Large Language Models

4.5 Memorizing Knowledge

Table 5: Exact match score for reproducing memorized knowledge. In contrast to in-context reasoning, RECKONING does not have direct access to the knowledge.

The results show that RECKONINGMT can successfully (average exact match score of 99.3%) recover the relevant facts from its model parameters when the context does not include any distractors. Note that this is comparable to the FT-ICR baseline, for which the task is much easier as it can directly attend to and copy the facts from input, while RECKONINGMT no longer has direct access to them. When the context includes distractors, both RECKONING and FT-ICR struggle to identify and reproduce only the relevant facts. However, the performance for FT-ICR (average 49.4%) drops far below that of RECKONING (73.6%), demonstrating that RECKONING is much better at disentangling the relevant knowledge from the distractors.

Finally, we show that RECKONING with a multi-task objective is also more robust to distractors as it trains the model to reproduce only the facts that would be relevant to a particular question we ask in the outer loop. As in Section 4.2, we use the ProofWriter dataset and, for each question, add all the distractors to the context. We train the model using the multi-task objective and report the label accuracy. While in Table 1, we originally saw a ∼ 1% improvement from training with a multi-task objective on ProofWriter with no distractors, we see a much more significant performance gap in Figure 6 (∼ 18.2%) when distractors are available. We also note that the performance of the single-task model is essentially random (see the Random-Facts baseline from Table 1). By learning how to memorize knowledge in the inner loop to recall relevant facts in the outer loop, the model also learns how to encode facts more robustly over them.

Figure 6: RECKONING performance when trained with a single-task and a multi-task objective under distractors. When trained with the multi-task objective, the model learns to memorize and reason over the relevant facts.

Authors:

(1) Zeming Chen, EPFL ([email protected]);

(2) Gail Weiss, EPFL ([email protected]);

(3) Eric Mitchell, Stanford University ([email protected])';

(4) Asli Celikyilmaz, Meta AI Research ([email protected]);

(5) Antoine Bosselut, EPFL ([email protected]).

This paper is available on arxiv under CC BY 4.0 DEED license.

[3] We perform this experiment in a limited setting and do not handle the case where hidden states could be cached for the forward pass of in-context reasoning, likely speeding up multi-question inference [12].

[4] This is done by prompting the model with the question and comparing its output (after its answer to the question) to the concatenation of all facts. The model is able to produce these facts in the expected order due to an implementation detail: they are numbered and labeled when given to the inner loop.

← Previous

The Efficiency of Dynamic Knowledge Encoding: RECKONING's Advantage in Multi-Question Scenarios

Up Next →

Connecting Reasoning, Model Editing, and LLM Knowledge: RECKONING's Contribution to AI Research