The easiest mistake in model evaluation is treating one score as the whole answer. Scores are useful, but they usually illuminate only one small part of the capability map.
Leaderboard scores are a starting point
Benchmarks help us compare knowledge, math, code, and reasoning. But a high score does not guarantee that a model fits your actual workflow.
Real-task evaluation is closer to product judgment
For a data analysis agent, the question is not only whether the answer is correct. It also needs to understand the business question, choose data, write runnable code, explain results, and notice anomalies.
Good evaluation must be reproducible
The test set, scoring rules, failure categories, and example records all need to be explicit. Otherwise a model may look better or worse simply because the test changed.
One-sentence takeaway
To judge a model, combine general scores, human experience, and real-task results. In product work, reliably solving the target task matters more than chasing one high score.