SKILL-MIX: A Flexible and Expandable Family of Evaluations for AI Models

Citation

Authors: (Multiple authors) Year: 2023 Venue: Preprint (arXiv) URL: http://arxiv.org/abs/2310.17567

Abstract

This work focuses on the evaluation of AI models through a flexible framework that tests their ability to combine and compose skills. The capability to combine skills plays an important role in (human) pedagogy and also in understanding emergence phenomena.

Summary

SKILL-MIX provides a contamination-resistant evaluation framework that tests whether LLMs can flexibly combine multiple skills simultaneously, going beyond standard benchmarks.

Key Contributions

A flexible evaluation framework testing skill composition ability
“Beyond Stochastic Parrot” criterion for genuine understanding
Evidence that models generate novel combinations not seen in training
Auto-grading methodology using GPT-4/LLaMA-2-70B

Core Concepts & Definitions

Skills Set

101 language skills from Wikipedia entries (e.g., metaphor, modus ponens, self-serving bias), each with definition and example.

SKILL-MIX $(k)$

Given $N$ skills and $T$ topics, sample random subset of $k$ skills and one topic, then prompt model to produce ~3 sentences demonstrating all $k$ skills in context of the topic.

Beyond Stochastic Parrot Criterion

Model surpasses “stochastic parrot” behavior if: $α_{k} > \frac{3}{2} p_{s}^{k} \cdot p_{t} \cdot L$ where:

$α_{k}$ = Ratio of Full Marks on SKILL-MIX $(k)$
$p_{s}$ = skill frequency in corpus
$p_{t}$ = topic frequency
$L$ = training corpus size

Main Results

GPT-4 achieves reasonable performance at $k = 5$ with $α_{5} \approx 0.12$
For $k \geq 5$ , GPT-4 generates combinations not seen in training
Performance follows approximate $Y = X^{2}$ relationship (single vs composite)
Most models saturate by $k = 3$ or $k = 4$ ; only GPT-4 performs well at $k = 5$
Evidence of “cramming for leaderboards” — high-ranked models underperform on SKILL-MIX

Relevance to Project

High — Directly relevant to our evaluation methodology:

Provides operational definition of skill composition
The $k$ -tuple testing aligns with our complexity filtration $S_{\leq k}$
“Beyond stochastic parrot” criterion relates to ontological-expansion
Auto-grading approach useful for our assessment framework

Questions & Notes

Can we adapt SKILL-MIX to test our algebraic skill compositions?
Their skill list (101 Wikipedia skills) could seed our primitive skill set $S_{0}$
Filtering common skills (>5% frequency) makes evaluation harder — implications for our fitness function $ϕ$ ?

Skills Calculus

Explorer

SKILL-MIX: A Flexible and Expandable Family of Evaluations for AI Models

SKILL-MIX: A Flexible and Expandable Family of Evaluations for AI Models

Citation

Abstract

Summary

Key Contributions

Core Concepts & Definitions

Skills Set

SKILL-MIX $(k)$

Beyond Stochastic Parrot Criterion

Main Results

Relevance to Project

Questions & Notes

Graph View

Table of Contents

Backlinks

Skills Calculus

Explorer

SKILL-MIX: A Flexible and Expandable Family of Evaluations for AI Models

SKILL-MIX: A Flexible and Expandable Family of Evaluations for AI Models

Citation

Abstract

Summary

Key Contributions

Core Concepts & Definitions

Skills Set

SKILL-MIX(k)

Beyond Stochastic Parrot Criterion

Main Results

Relevance to Project

Questions & Notes

Related Papers

Graph View

Table of Contents

Backlinks

SKILL-MIX $(k)$