Transformers Can Learn Meta-skills for Task Generalization in In-Context Learning
Citation
Authors: Yao Fan et al. Year: 2024 Venue: URL:
Abstract
Can Transformers learn “meta-skills” that enable composition of basic skills to generalize to unseen task combinations? Prior work showed Transformers trained on fail on compositions. This paper demonstrates conditions under which meta-skill learning succeeds.
Summary
Trains transformers on in-context learning of function classes and their compositions, demonstrating generalization to held-out compositions through learned “meta-skills.”
Key Contributions
- Definition of meta-skills as compositional operators
- Proof that training on partial compositions enables generalization
- Weak-to-strong generalization (2-compositions → 3-5 compositions)
- Importance of orthogonal basis functions for meta-skill learning
Core Concepts & Definitions
Basic Skill
The ability to perform in-context learning (ICL) on a function class (e.g., linear, quadratic, sine, sqrt, heaviside).
Composite Skill
ICL on composite function class:
Meta-skill
The high-level skill required for skill composition:
- Identifying if in-context samples come from a composite function
- Identifying the needed combination of basic ICL skills
- Applying a composite ICL skill on-the-fly
Function Composition Operations
- Addition:
- Maximum:
- Multiplexing:
Main Results
- Partial Composition: Training on and enables ICL on
- Cross Composition: Training on enables generalization to
- Weak-to-strong: 2-function compositions → 3-5 function compositions
- Orthogonal Basis Requirement: Fourier, Legendre polynomials crucial for generalization
Relevance to Project
Critical — Most directly aligned with our algebraic framework:
- Their function class composition = our skill composition
- Meta-skill definition matches our higher-order
- Provides experimental validation of compositional skill learning
- Orthogonal basis insight relevant for our primitive skill selection
Questions & Notes
- Can we extend their algebraic operations (add, max, multiplex) to linguistic skills?
- How do their generalization bounds relate to our emergence thresholds?
- Their setup is mathematical — how to translate to natural language domain?