Implicit capabilities of language models
Repository URI
Repository DOI
Change log
Authors
Abstract
We study implicit capabilities of language models: abilities that emerge from standard training without the models being directly trained to possess them. As AI systems become more powerful, understanding such capabilities matters for anticipating these systems' behaviour and designing safety interventions. We present two such capabilities, both demonstrated using similar multi-stage fine-tuning setups with alias-entity datasets.
First, we show that language models can learn indicators of source reliability and subsequently internalize new apparently-reliable information to a greater extent. We first fine-tune models on tagged statements in which “tags” -- initially-random strings -- correlate with usefulness for predicting other text. This leads to implicit meta-learning: in subsequent training, the model internalizes new statements more strongly when they carry tags it has learned to associate with usefulness, so it is more likely to answer related questions as if these apparently-reliable statements were true. This selective internalization happens even though the loss treats all statements identically. We observe this capability in language and vision models across architectures, including models trained from scratch. This work also introduced the notion of out-of-context learning, which has informed subsequent research on how and what models can learn in principle.
Second, we demonstrate that language models can track at which point of training they learned specific facts: training-order information is linearly encoded in their activations. We sequentially fine-tune models on disjoint datasets to create a known training order, then measure average activations for held-out samples from each stage. These averages lie along a “recency” direction and are arranged exactly in the order of training. Linear probes distinguish early from late training stages with about 90% accuracy, and models can also be fine-tuned to report an entity's training stage. This capability has implications for models' ability to detect and resist modifications.
Together, these results show that language models can track and use nontrivial metadata about their training. We argue that understanding such implicit capabilities is important for assessing risks and governing more powerful systems.
Description
Date
Advisors
Turner, Richard

