A Theory for Emergence of Complex Skills in Language Models

Citation

Authors: Sanjeev Arora, Anirudh Goyal Year: 2023 Venue: Preprint (arXiv) URL: http://arxiv.org/abs/2307.15936

Abstract

A driver of current AI research is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework.

Summary

This paper provides a theoretical framework for understanding how complex skills emerge in LLMs through scaling, using statistical methods rather than mechanistic gradient analysis.

Key Contributions

A statistical framework relating cross-entropy loss to competence on basic skills
Mathematical analysis showing Scaling Laws imply strong inductive-bias (“slingshot generalization”)
Proof that competence at “complex skills” (involving $k$ -tuples of basic skills) emerges at similar scaling as elementary skills

Core Concepts & Definitions

Skill Graph

A bipartite graph $G = (V_{1}, V_{2}, E)$ where:

$V_{1}$ = set of skills $S$
$V_{2}$ = set of text-pieces $T$
Edge $(s, t) \in E$ means comprehending text-piece $t$ requires skill $s$

Competence

For skill $s \in S$ : $Competence (s) \in [0, 1]$ representing success rate on cloze questions from randomly selected text-pieces adjacent to $s$ .

Slingshot Generalization

The phenomenon where pre-trained models learn very efficiently, appearing to violate usual generalization theory.

Main Results

Theorem 14 (Emergence of $k^{'}$ -tuples): Competence in skill $k^{'}$ -tuples improves almost as fast as competence on individual skills with scaling.
Corollary 13: When model scales such that loss drops from $δ$ to $δ / k^{'}$ , performance on $k^{'}$ -tuples equals previous performance on individual skills.
Key Insight: 10× scaling ≈ 2× increase in number of skills that can be composed.
Poverty of Stimulus: If model displays competency on 10% of $k^{'}$ -tuples, it must have acquired competence in combinations not seen during training.

Relevance to Project

Critical — This paper provides the theoretical foundation for our skills-algebra framework:

Supports treating skills as composable units
Explains why skill composition emerges with scale
The “skill graph” concept relates to our ontology work
Their deliberate avoidance of formalizing skill composition is what our project aims to address

Questions & Notes

How does their statistical framework connect to our algebraic approach?
Can we use their emergence thresholds to constrain our ontological-expansion?
Their conservative accounting “obviates the need for a mathematical formulation of what skills are” — we’re taking the opposite approach

Skills Calculus

Explorer

A Theory for Emergence of Complex Skills in Language Models

A Theory for Emergence of Complex Skills in Language Models

Citation

Abstract

Summary

Key Contributions

Core Concepts & Definitions

Skill Graph

Competence

Slingshot Generalization

Main Results

Relevance to Project

Questions & Notes

Graph View

Table of Contents

Backlinks

Skills Calculus

Explorer

A Theory for Emergence of Complex Skills in Language Models

A Theory for Emergence of Complex Skills in Language Models

Citation

Abstract

Summary

Key Contributions

Core Concepts & Definitions

Skill Graph

Competence

Slingshot Generalization

Main Results

Relevance to Project

Questions & Notes

Related Papers

Graph View

Table of Contents

Backlinks