SKILL-MIX: A Flexible and Expandable Family of Evaluations for AI Models

Citation

Authors: (Multiple authors) Year: 2023 Venue: Preprint (arXiv) URL: http://arxiv.org/abs/2310.17567

Abstract

This work focuses on the evaluation of AI models through a flexible framework that tests their ability to combine and compose skills. The capability to combine skills plays an important role in (human) pedagogy and also in understanding emergence phenomena.

Summary

SKILL-MIX provides a contamination-resistant evaluation framework that tests whether LLMs can flexibly combine multiple skills simultaneously, going beyond standard benchmarks.

Key Contributions

  1. A flexible evaluation framework testing skill composition ability
  2. “Beyond Stochastic Parrot” criterion for genuine understanding
  3. Evidence that models generate novel combinations not seen in training
  4. Auto-grading methodology using GPT-4/LLaMA-2-70B

Core Concepts & Definitions

Skills Set

101 language skills from Wikipedia entries (e.g., metaphor, modus ponens, self-serving bias), each with definition and example.

SKILL-MIX

Given skills and topics, sample random subset of skills and one topic, then prompt model to produce ~3 sentences demonstrating all skills in context of the topic.

Beyond Stochastic Parrot Criterion

Model surpasses “stochastic parrot” behavior if: where:

  • = Ratio of Full Marks on SKILL-MIX
  • = skill frequency in corpus
  • = topic frequency
  • = training corpus size

Main Results

  1. GPT-4 achieves reasonable performance at with
  2. For , GPT-4 generates combinations not seen in training
  3. Performance follows approximate relationship (single vs composite)
  4. Most models saturate by or ; only GPT-4 performs well at
  5. Evidence of “cramming for leaderboards” — high-ranked models underperform on SKILL-MIX

Relevance to Project

High — Directly relevant to our evaluation methodology:

  • Provides operational definition of skill composition
  • The -tuple testing aligns with our complexity filtration
  • “Beyond stochastic parrot” criterion relates to ontological-expansion
  • Auto-grading approach useful for our assessment framework

Questions & Notes

  • Can we adapt SKILL-MIX to test our algebraic skill compositions?
  • Their skill list (101 Wikipedia skills) could seed our primitive skill set
  • Filtering common skills (>5% frequency) makes evaluation harder — implications for our fitness function ?