SKILL-MIX: A Flexible and Expandable Family of Evaluations for AI Models
Citation
Authors: (Multiple authors) Year: 2023 Venue: Preprint (arXiv) URL: http://arxiv.org/abs/2310.17567
Abstract
This work focuses on the evaluation of AI models through a flexible framework that tests their ability to combine and compose skills. The capability to combine skills plays an important role in (human) pedagogy and also in understanding emergence phenomena.
Summary
SKILL-MIX provides a contamination-resistant evaluation framework that tests whether LLMs can flexibly combine multiple skills simultaneously, going beyond standard benchmarks.
Key Contributions
- A flexible evaluation framework testing skill composition ability
- “Beyond Stochastic Parrot” criterion for genuine understanding
- Evidence that models generate novel combinations not seen in training
- Auto-grading methodology using GPT-4/LLaMA-2-70B
Core Concepts & Definitions
Skills Set
101 language skills from Wikipedia entries (e.g., metaphor, modus ponens, self-serving bias), each with definition and example.
SKILL-MIX
Given skills and topics, sample random subset of skills and one topic, then prompt model to produce ~3 sentences demonstrating all skills in context of the topic.
Beyond Stochastic Parrot Criterion
Model surpasses “stochastic parrot” behavior if: where:
- = Ratio of Full Marks on SKILL-MIX
- = skill frequency in corpus
- = topic frequency
- = training corpus size
Main Results
- GPT-4 achieves reasonable performance at with
- For , GPT-4 generates combinations not seen in training
- Performance follows approximate relationship (single vs composite)
- Most models saturate by or ; only GPT-4 performs well at
- Evidence of “cramming for leaderboards” — high-ranked models underperform on SKILL-MIX
Relevance to Project
High — Directly relevant to our evaluation methodology:
- Provides operational definition of skill composition
- The -tuple testing aligns with our complexity filtration
- “Beyond stochastic parrot” criterion relates to ontological-expansion
- Auto-grading approach useful for our assessment framework
Questions & Notes
- Can we adapt SKILL-MIX to test our algebraic skill compositions?
- Their skill list (101 Wikipedia skills) could seed our primitive skill set
- Filtering common skills (>5% frequency) makes evaluation harder — implications for our fitness function ?