A Theory for Emergence of Complex Skills in Language Models

Citation

Authors: Sanjeev Arora, Anirudh Goyal Year: 2023 Venue: Preprint (arXiv) URL: http://arxiv.org/abs/2307.15936

Abstract

A driver of current AI research is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework.

Summary

This paper provides a theoretical framework for understanding how complex skills emerge in LLMs through scaling, using statistical methods rather than mechanistic gradient analysis.

Key Contributions

  1. A statistical framework relating cross-entropy loss to competence on basic skills
  2. Mathematical analysis showing Scaling Laws imply strong inductive-bias (“slingshot generalization”)
  3. Proof that competence at “complex skills” (involving -tuples of basic skills) emerges at similar scaling as elementary skills

Core Concepts & Definitions

Skill Graph

A bipartite graph where:

  • = set of skills
  • = set of text-pieces
  • Edge means comprehending text-piece requires skill

Competence

For skill : representing success rate on cloze questions from randomly selected text-pieces adjacent to .

Slingshot Generalization

The phenomenon where pre-trained models learn very efficiently, appearing to violate usual generalization theory.

Main Results

  1. Theorem 14 (Emergence of -tuples): Competence in skill -tuples improves almost as fast as competence on individual skills with scaling.

  2. Corollary 13: When model scales such that loss drops from to , performance on -tuples equals previous performance on individual skills.

  3. Key Insight: 10× scaling ≈ 2× increase in number of skills that can be composed.

  4. Poverty of Stimulus: If model displays competency on 10% of -tuples, it must have acquired competence in combinations not seen during training.

Relevance to Project

Critical — This paper provides the theoretical foundation for our skills-algebra framework:

  • Supports treating skills as composable units
  • Explains why skill composition emerges with scale
  • The “skill graph” concept relates to our ontology work
  • Their deliberate avoidance of formalizing skill composition is what our project aims to address

Questions & Notes

  • How does their statistical framework connect to our algebraic approach?
  • Can we use their emergence thresholds to constrain our ontological-expansion?
  • Their conservative accounting “obviates the need for a mathematical formulation of what skills are” — we’re taking the opposite approach