Skip to main content

Fix “evaluation flaky or inconsistent”

Stub

This How-to is a stub. Evaluator authoring is still evolving; this page captures the intended debugging workflow.

Goal

You will reduce evaluator flakiness by controlling inputs, running multi-trial evaluations, and separating deterministic vs rubric checks.

When to use this

A capability passes sometimes and fails sometimes for the same test.
Rubric evaluation scores vary widely.

Prerequisites

A test case/dataset with fixed inputs
An evaluator (deterministic or rubric-based)
An LLM profile pinned (for rubric-based evaluation)

Steps

Make inputs deterministic (pin versions, fix prompts, remove external variability).
Prefer deterministic checks where possible.
For rubric evaluation:
- run k trials,
- aggregate results and record uncertainty.
Treat missing evidence as failure or inconclusive (avoid guessing).

Verify

The evaluation result is stable enough to support promotion decisions.

Troubleshooting

External dependencies → mock or record as artifacts.
Prompt drift → pin prompts and schemas; store them as artifacts.
Too subjective → refine rubric criteria and scoring guidance.

Cleanup / Rollback

None.

Next steps

How-to: Run multi-trial stability evaluation
How-to: Define promotion thresholds

Goal
When to use this
Prerequisites
Steps
Verify
Troubleshooting
Cleanup / Rollback
Next steps