The easiest mistake in model evaluation is treating one score as the whole answer. Scores are useful, but they usually illuminate only one small part of the capability map.

Benchmark Useful for quick comparison, but often far from real business tasks.
Human preference Measures whether answers feel natural, clear, and helpful, but costs more.
Real tasks Closest to product value, but requires clear data, constraints, and success criteria.

Leaderboard scores are a starting point

Benchmarks help us compare knowledge, math, code, and reasoning. But a high score does not guarantee that a model fits your actual workflow.

Real-task evaluation is closer to product judgment

For a data analysis agent, the question is not only whether the answer is correct. It also needs to understand the business question, choose data, write runnable code, explain results, and notice anomalies.

Good evaluation must be reproducible

The test set, scoring rules, failure categories, and example records all need to be explicit. Otherwise a model may look better or worse simply because the test changed.

One-sentence takeaway

To judge a model, combine general scores, human experience, and real-task results. In product work, reliably solving the target task matters more than chasing one high score.