Implicit capabilities of language models

Krasheninnikov, Dmitrii

doi:https://doi.org/10.17863/CAM.124900

Implicit capabilities of language models

Repository URI

https://www.repository.cam.ac.uk/handle/1810/395362

Repository DOI

https://doi.org/10.17863/CAM.124900

Files

Primary Thesis (3.17 MB)

Type

Thesis

Authors

Krasheninnikov, Dmitrii

Abstract

We study implicit capabilities of language models: abilities that emerge from standard training without the models being directly trained to possess them. As AI systems become more powerful, understanding such capabilities matters for anticipating these systems' behaviour and designing safety interventions. We present two such capabilities, both demonstrated using similar multi-stage fine-tuning setups with alias-entity datasets.

First, we show that language models can learn indicators of source reliability and subsequently internalize new apparently-reliable information to a greater extent. We first fine-tune models on tagged statements in which “tags” -- initially-random strings -- correlate with usefulness for predicting other text. This leads to implicit meta-learning: in subsequent training, the model internalizes new statements more strongly when they carry tags it has learned to associate with usefulness, so it is more likely to answer related questions as if these apparently-reliable statements were true. This selective internalization happens even though the loss treats all statements identically. We observe this capability in language and vision models across architectures, including models trained from scratch. This work also introduced the notion of out-of-context learning, which has informed subsequent research on how and what models can learn in principle.

Second, we demonstrate that language models can track at which point of training they learned specific facts: training-order information is linearly encoded in their activations. We sequentially fine-tune models on disjoint datasets to create a known training order, then measure average activations for held-out samples from each stage. These averages lie along a “recency” direction and are arranged exactly in the order of training. Linear probes distinguish early from late training stages with about 90% accuracy, and models can also be fine-tuned to report an entity's training stage. This capability has implications for models' ability to detect and resist modifications.

Together, these results show that language models can track and use nontrivial metadata about their training. We argue that understanding such implicit capabilities is important for assessing risks and governing more powerful systems.

Date

2025-11-18

Advisors

Krueger, David Scott
Turner, Richard

Keywords

Machine learning, Neural networks, Interpretability, Deep learning, Out-of-context learning, In-context learning

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International (CC BY 4.0)

Collections

Theses - Engineering

Implicit capabilities of language models

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights and licensing

Collections