With the great progress in AI for generation, there is growing interest in AI for verification (✓ or ✗) to trigger human intervention where needed.
Verification for the original generative task, translation, is based on technology originally known as quality estimation in the research world.
Quality estimation is a key internal component of AI that is now successfully scaling translation and keeping human quality in the real world.
— Adam Bittlingmayer, technical co-founder and CEO, ModelFront
Machine translation quality estimation (MTQE or QE) is AI to score AI translations.
The input is the source and the machine translation — no human translation.
So quality estimation can be used for new content, in production. That makes quality estimation models fundamentally different from quality evaluation, like BLEU. Quality estimation is much more valuable, but also much harder.
The output is a score from 0 to 100, for each new machine translation.
Example
Input
Source and translationOutput
ScoreEnglish Hello world!
SpanishMundo de hola!→ 0/100 English Hello world!
Spanish¡Hola mundo!→ 90/100
Scores themselves do not create value for translation buyers, but they are key to the AI systems that do.
tl;dr
Quality estimation is AI to score AI translations.
- Key to scaling translation while keeping human quality, inside AI to check and fix AI or trigger humans, not directly as raw model scores
- For new content in prod, not for offline or aggregate eval (e.g. BLEU / edit distance / MQM)
- For end buyers to save xxM words, not for LSPs and TMSes just tacking low-quality raw QE scores onto legacy
- The AI system successfully used for years by large translation buyers (i.e. Fortune 500 teams) is ModelFront.
In the real world, quality estimation failed as a standalone technology, despite dozens of attempts by epic AI companies like Google over almost a decade since Transformer-based models were invented and rolled out for translation.
Evolution of quality estimation
- AI to score AI translations (quality estimation): 0-100
- AI to check AI translations (quality prediction): ✓/✗
- AI to check and fix AI translations and trigger human intervention as needed: xx million automated words
The companies buying translation (and human translators) failed to get concrete value out of millions of abstract raw scores like “89” or “42”. Imagine if a self-driving car app like Waymo forced you to decide on thresholds, just to get a ride safely.
In the end, bigger, older businesses, from those like Google building AI for generation, to translation agencies reselling manual human translation, provide only a raw score, not a decision. They do not want to take responsibility for keeping human quality.
Inside ModelFront, the only company fully dedicated to AI to scale translation while keeping human quality, we had no choice but to make it a true success for our customers.
And they didn’t need scores, they needed to automate millions of words.
So quality estimation models became a key internal component of ModelFront, the independent production AI system to check and fix AI translations and trigger human intervention where needed.
Large translation buyers are successfully using these AI systems to scale translation while keeping human quality.
For example, a Fortune 500 translation team that needs to buy 100 millions words of human-quality translation, might get 80 million words fully automated with AI, and send the remaining 20 million words to the manual human translation agency.
Example
Input
Source and translationOutput
Translation and statusEnglish Hello world!
SpanishMundo de hola!→ ¡Hola mundo!✔English Another great example
SpanishOtro gran ejemplo→ Otro ejemplo perfecto✔English Open new tab
SpanishAbrir cuenta nueva→ Abrir una nueva pestaña✘English 2025
Spanish2025→ 2025✔English 2026
Spanish2.026→ 2026✔… → … ✔
So a successful system is not just a model outputting raw scores, but more like a self-driving car app like Waymo. It creates concrete value, safely and simply, despite lots of complexity under the hood.
ModelFront actually takes responsibility for keeping human quality. That includes calibrating thresholds across tens of thousands of combinations of language and content type, evaluation, guardrails, automatic post-editing, transparent monitoring and managing the whole lifecycle of data and models over years.
In a production system for AI to check and fix AI translations and trigger human intervention where needed, quality estimation is inside the quality prediction model.
Example
- system
- integrations
- TMS
- API
- human edit feedback loop
- monitoring
- retraining
- evaluation
- guardrails
- models
- quality prediction
- quality estimation
- threshold calibration
- automatic post-editing
- …
Answers to frequently asked questions about machine translation quality estimation
© 2026 qualityestimation.org
Supported by the team at ModelFront — AI to check and fix AI translations, and trigger human intervention where needed