Transformers Can Learn Meta-skills for Task Generalization in In-Context Learning

Citation

Authors: Yao Fan et al. Year: 2024 Venue: URL:

Abstract

Can Transformers learn “meta-skills” that enable composition of basic skills to generalize to unseen task combinations? Prior work showed Transformers trained on fail on compositions. This paper demonstrates conditions under which meta-skill learning succeeds.

Summary

Trains transformers on in-context learning of function classes and their compositions, demonstrating generalization to held-out compositions through learned “meta-skills.”

Key Contributions

  1. Definition of meta-skills as compositional operators
  2. Proof that training on partial compositions enables generalization
  3. Weak-to-strong generalization (2-compositions → 3-5 compositions)
  4. Importance of orthogonal basis functions for meta-skill learning

Core Concepts & Definitions

Basic Skill

The ability to perform in-context learning (ICL) on a function class (e.g., linear, quadratic, sine, sqrt, heaviside).

Composite Skill

ICL on composite function class:

Meta-skill

The high-level skill required for skill composition:

  1. Identifying if in-context samples come from a composite function
  2. Identifying the needed combination of basic ICL skills
  3. Applying a composite ICL skill on-the-fly

Function Composition Operations

  • Addition:
  • Maximum:
  • Multiplexing:

Main Results

  1. Partial Composition: Training on and enables ICL on
  2. Cross Composition: Training on enables generalization to
  3. Weak-to-strong: 2-function compositions → 3-5 function compositions
  4. Orthogonal Basis Requirement: Fourier, Legendre polynomials crucial for generalization

Relevance to Project

Critical — Most directly aligned with our algebraic framework:

  • Their function class composition = our skill composition
  • Meta-skill definition matches our higher-order
  • Provides experimental validation of compositional skill learning
  • Orthogonal basis insight relevant for our primitive skill selection

Questions & Notes

  • Can we extend their algebraic operations (add, max, multiplex) to linguistic skills?
  • How do their generalization bounds relate to our emergence thresholds?
  • Their setup is mathematical — how to translate to natural language domain?