Quality estimation › FAQ › How is quality estimation different from quality evaluation, like BLEU?
Quality estimation and evaluation. are not even in the same category.
Quality estimation models are fundamentally different from quality evaluation metrics or frameworks like BLEU, edit distance or MQM.
Those eval metrics and frameworks are for comparing machine translation engines or manual human translation services. They are rules that require pre-existing human reference translations or annotations. They are not even meant to be accurate at the sentence level anyway, just directionally correct on average. (BLEU does not even consider the source text.)
| Quality estimation AI to score AI translations |
Quality evaluation Rules to compare AI translation systems |
|
|---|---|---|
| Type | AI task and model | Metric and rules |
| Goal | For verifying each segment, at the sentence level, inside a production AI system | For comparing AI translation systems (e.g. machine translation engines), on a sample dataset, offline |
| Input | Source, machine translation | Sources, machine translations, human reference translations |
| Output | Score per translation | Score per sample |
| Examples | Inside of ModelFront, OpenKiwi, QuEst++ | BLEU, edit distance, MQM, COMET |
*There are now also quality evaluation frameworks, like COMET, that use a type of quality estimation to do quality evaluation. But it’s just not possible to use quality evaluation to do quality estimation.
← Learn more about quality estimation
© 2026 qualityestimation.org
Supported by the team at ModelFront — AI to check and fix AI translations, and trigger human intervention where needed