The Quantization Model of Neural Scaling

Citation

Authors: Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark Year: 2024 Venue: NeurIPS 2023 (24 pages, 18 figures) URL: http://arxiv.org/abs/2303.13506

Abstract

We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are “quantized” into discrete chunks (quanta).

Summary

This paper provides a mechanistic explanation for both smooth scaling laws AND sudden capability emergence by proposing that skills are discrete “quanta” learned in order of decreasing use frequency.

Key Contributions

  1. Quantization Hypothesis: skills are discrete chunks learned by frequency
  2. Explains power law scaling from Zipfian frequency distribution
  3. Distinguishes monogenic (sharp transition) vs polygenic (gradual) samples
  4. Validates on toy datasets and analyzes LLM scaling curves

Core Concepts & Definitions

Quantization Hypothesis

Network knowledge and skills are quantized into discrete modules (quanta). Models learn these quanta in order of decreasing “use frequency” in the training distribution.

Quantum (pl. Quanta)

A discrete unit of knowledge/skill. Analogous to Minsky’s “Society of Mind” agents.

Monogenic Sample

A prediction problem whose performance is determined by a single quantum; exhibits sharp phase transition at learning threshold.

Polygenic Sample

A prediction problem where multiple quanta influence performance; exhibits gradual improvement with scale.

Q-Sequence

The ordering of quanta by use frequency, determining learning priority.

Main Results

If quanta use frequencies follow power law (Zipfian), then:

  • Parameter Scaling:
  • Data Scaling (multi-epoch):
  • Data Scaling (single-epoch):

Relevance to Project

High — Provides mechanism underlying emergence:

  • “Quanta” concept maps to our primitive skills
  • Zipfian frequency relates to our complexity filtration
  • Monogenic/polygenic distinction relevant for task-relative ontology
  • Could inform how we model skill acquisition order

Questions & Notes

  • How do quanta relate to our algebraically composed skills?
  • Does the Q-sequence correspond to our complexity ordering?
  • Can we use their framework to predict which skill compositions will emerge?