Evaluation

Definition

Skill evaluation is the assessment of whether and how well a model can apply skills to tasks.

In the Literature

SKILL-MIX

  • SKILL-MIX: Generate text demonstrating skills on random topic
  • Metrics:
    • Skill Fraction: proportion of skills exhibited
    • Full Marks Ratio : proportion achieving perfect score
  • Auto-grading: GPT-4 / LLaMA-2-70B judges

Beyond Stochastic Parrot Criterion

Model surpasses memorization if:

Competence (Arora & Goyal)

ACD (Lu et al.)

  • Automated task generation for capability discovery
  • Interestingness filtering for novel evaluations
  • Capability clustering and reporting

In This Project

Fitness Function

Measures how well skill solves task .

Evaluation Decomposition

The interaction term captures emergent effects beyond individual skills.

Task-Relative Assessment

The set defines which skills are competent for task .

References