With the great progress in AI for generation, there is growing interest in AI for verification (✓ or ✗) to trigger human intervention where needed.
Verification for the original generative task, translation, is based on technology originally known as quality estimation in the research world.
Quality estimation is a key internal component of AI that is now successfully scaling translation and keeping human quality in the real world.
— Adam Bittlingmayer, technical co-founder and CEO, ModelFront
Machine translation quality estimation (MTQE or QE) is AI to score AI translations.
The input is the source and the machine translation — no human translation. The output is a score from 0 to 100, for each new machine translation.
Example
Input
Source and translationOutput
ScoreEnglishHello world!SpanishMundo de hola!0/100 EnglishHello world!Spanish¡Hola mundo!90/100
So quality estimation can be used for new content, in production. That makes quality estimation models fundamentally different from quality evaluation, like BLEU. Quality estimation is much more valuable, but also much harder.
tl;dr
- Quality estimation is AI to score AI translations
- Key to scaling translation while keeping human quality
- Used as part of an AI system to check and fix AI translations and trigger human intervention where needed, not directly as raw model scores
- For new content in production, not for offline or aggregate evaluation (cf. BLEU, edit distance or MQM)
- For end buyers, not for LSPs and TMSes just tacking low-quality raw QE scores onto legacy services and products
- The system successfully used for years by large translation buyers (i.e. Fortune 500 teams) is ModelFront.
In the real world, quality estimation failed as a standalone technology, despite dozens of attempts by epic AI companies like Google over almost a decade since Transformer-based models were invented and rolled out for translation.
The companies buying translation (and human translators) failed to get concrete value out of millions of abstract raw scores like “89” or “42”. Imagine if a self-driving car app like Waymo forced you to decide on thresholds, just to get a ride safely.
Ultimately, bigger older businesses, from top AI companies like Google selling machine translaiton to translation agencies selling manual human translation, provide a raw score, not a decision, because they do not want to take responsibility for keeping human quality.
Rather, quality estimation models became a key internal component of ModelFront, the independent production AI system to check and fix AI translations and trigger human intervention where needed.
Quality estimation: AI to score AI (0-100) → Quality prediction: AI to check AI (✓/✗) → Trigger human intervention where needed → Scale translation while keeping human quality
Large translation buyers are successfully using these AI systems to scale translation while keeping human quality.
For example, a Fortune 500 translation team that needs to buy 100 millions words of human-quality translation, might get 80 million words fully automated with AI, and send the remaining 20 million words to the manual human translation agency.
Example
Input
Source and translationOutput
Translation and status… … EnglishHello world!SpanishMundo de hola!¡Hola mundo!✓ EnglishAnother great exampleSpanishOtro gran ejemploOtro ejemplo perfecto✓ EnglishOpen new tabSpanishAbrir una cuenta nuevaAbrir una pestaña nuevaEnglish2025Spanish20252025✓ English2026Spanish2.0262026✓ … …
So a successful system is not just a raw scoring model, but more like a self-driving car app like Waymo. It creates concrete value, safely and simply, despite lots of complexity under the hood.
It takes responsibility for keeping human quality, from calibrating thresholds across tens of thousands of combinations of language and content type, to evaluation, guardrails, automatic post-editing, transparent monitoring and managing the whole lifecycle of data and models.
How the quality estimation component fits into a production system for AI to check and fix AI translations and trigger human intervention where needed
- system
- integrations
- TMS
- API
- human edit feedback loop
- monitoring
- retraining
- evaluation
- guardrails
- models
- quality prediction
- quality estimation
- threshold calibration
- automatic post-editing
- …
Answers to frequently asked questions about machine translation quality estimation
There are open-source quality estimation models and libraries from the research world, that can be used has a base to try to build a homebrew system.
Production AI systems like ModelFront are fundamentally built to support more than 100 languages.
Production AI systems like ModelFront are available via integrations in the top translation management systems.
Many TMSes have also added some kind of automatic scoring or rating feature, but failed to make it work in the real world.
Quality estimation models are fundamentally different from quality evaluation metrics or frameworks like BLEU, edit distance or MQM.
Those eval metrics and frameworks are for comparing machine translation engines or manual human translation services. They require pre-existing human reference translations or annotations. They are not even meant to be accurate at the sentence level anyway, just directionally correct on average.
| Quality estimation | Quality evaluation | |
|---|---|---|
| Type | AI task and model | Metric |
| Goal | For verifying each segment, at the sentence level, inside a production AI system | For comparing AI translation systems (e.g. machine translation engines), on a sample dataset, offline |
| Input | Source, machine translation | Sources, machine translations, human reference translations |
| Output | Score per translation | Score per sample |
| Built with AI | Built with rules* | |
| Examples: inside of ModelFront, OpenKiwi, QuEst++ | Examples: BLEU, edit distance, MQM, COMET |
*There are now also quality evaluation frameworks, like COMET, that use a type of quality estimation to do quality evaluation. But it’s just not possible to use quality evaluation to do quality estimation.
| Quality *estimation AI to score AI translations |
Quality prediction AI to check AI translations |
ModelFront AI to check and fix AI translations or trigger human intervention |
|
|---|---|---|---|
| What it predicts | Overall usefulness of a translation (good enough or not) | Detailed outcomes (edit time, post-editing cost, user behavior, business KPIs) | Whether translation needs fixing or human review, then automatically fixes or routes |
| Output | Abstract quality score or ✓/✗ | Task- or business-specific signal (seconds saved, cost, clicks, conversions) | Fixed translation or routing decision with quality guarantees |
| Scope | Focused on translation quality itself | Connects translation quality to downstream impact | End-to-end system ensuring human quality with automatic fixes and guardrails |
| Typical use | Routing, triage, human-in-the-loop decisions | Optimization of workflows, budgets and product experiences | Production systems for large translation buyers scaling translation while keeping human quality |
Machine translation quality estimation first evolved as a research task in the early 2010s with the support of professor Lucia Specia and industry folks like Radu Soricut at Google, before the era of deep learning. Professor Specia and her team released multiple open-source machine learning libraries and frameworks, and a book on the topic.
Unbabel, a startup backed by Y Combinator, the top accelerator, built new type of translation company around quality estimation models and manual human translation, and it open-sourced libraries and shared research.
ModelFront launched the first quality estimation API and created the category in the 2020s, based on LLMs that supported more than 100 languages.
ModelFront soon made AI to check and fix AI work in the real world, for Fortune 500 translation buyers, by taking responsibility for keeping human quality, right inside legacy systems.
While translation is the first generative task, the basic idea is not new, nor unique to translation. Automatically triggering human intervention is key to autopilots and now self-driving cars. And among other language generation tasks, there are now startups like Momentic — AI verification for coding.
The value created by the systems that include quality estimation should go and is going primarily to end buyers.
Quality estimation failed as a copycat AI feature tacked on to manual human translation services, because they have direct conflict of interest, given their existing business model.
It would be like taxi companies inventing self-driving cars. They typically lack both the motivation, and the AI research and engineering DNA, to crack this hard problem.
In fact, even companies that started out as AI companies stopped disrupting and started pushing lock-in when they started selling yet another manual human translation service.
(Typically under the influence or pressure of their venture capital or private equity investors pushing “vertical integration” without deeply understanding the vertical.)
This echoes the adoption pattern of the most significant earlier translation automation technology, the translation memory.
In the long run, agencies benefit, because increasing efficiency increases consumption (Jevons paradox), albeit indirectly and in aggregate.
© 2025 qualityestimation.org
Supported by the team at ModelFront — AI to check and fix AI translations, and trigger human intervention where needed