Quality estimation

With the great progress in AI for generation, there is growing interest in AI for verification (✓ or ✗) to trigger human intervention where needed.

Verification for the original generative task, translation, is based on technology originally known as quality estimation in the research world.

Quality estimation is a key internal component of AI that is now successfully scaling translation and keeping human quality in the real world.

— Adam Bittlingmayer, technical co-founder and CEO, ModelFront

Machine translation quality estimation (MTQE or QE) is AI to score AI translations.

AI to score AI translations

The input is the source and the machine translation — no human translation.

So quality estimation can be used for new content, in production. That makes quality estimation models fundamentally different from quality evaluation, like BLEU. Quality estimation is much more valuable, but also much harder.

The output is a score from 0 to 100, for each new machine translation.

Example

Input
Source and translation Output
Score

English Hello world!
Spanish Mundo de hola! → 0/100

English Hello world!
Spanish ¡Hola mundo! → 90/100

Input Source and translation		Output Score
English `Hello world!` Spanish `Mundo de hola!`	→	0/100
English `Hello world!` Spanish `¡Hola mundo!`	→	90/100

Scores themselves do not create value for translation buyers, but they are key to the AI systems that do.

tl;dr

Quality estimation is AI to score AI translations.

Key to scaling translation while keeping human quality, inside AI to check and fix AI or trigger humans, not directly as raw model scores

For new content in prod, not for offline or aggregate eval (e.g. BLEU / edit distance / MQM)

For end buyers to save xxM words, not for LSPs and TMSes just tacking low-quality raw QE scores onto legacy

The AI system successfully used for years by large translation buyers (i.e. Fortune 500 teams) is ModelFront.

From research task to production system

In the real world, quality estimation failed as a standalone technology, despite dozens of attempts by epic AI companies like Google over almost a decade since Transformer-based models were invented and rolled out for translation.

Evolution of quality estimation

AI to score AI translations (quality estimation): 0-100

AI to check AI translations (quality prediction): ✓/✗

AI to check and fix AI translations and trigger human intervention as needed: xx million automated words

The companies buying translation (and human translators) failed to get concrete value out of millions of abstract raw scores like “89” or “42”. Imagine if a self-driving car app like Waymo forced you to decide on thresholds, just to get a ride safely.

In the end, bigger, older businesses, from those like Google building AI for generation, to translation agencies reselling manual human translation, provide only a raw score, not a decision. They do not want to take responsibility for keeping human quality.

Inside ModelFront, the only company fully dedicated to AI to scale translation while keeping human quality, we had no choice but to make it a true success for our customers.

And they didn’t need scores, they needed to automate millions of words.

So quality estimation models became a key internal component of ModelFront, the independent production AI system to check and fix AI translations and trigger human intervention where needed.

Large translation buyers are successfully using these AI systems to scale translation while keeping human quality.

For example, a Fortune 500 translation team that needs to buy 100 millions words of human-quality translation, might get 80 million words fully automated with AI, and send the remaining 20 million words to the manual human translation agency.

Example

Input
Source and translation Output
Translation and status

English Hello world!
Spanish Mundo de hola! → ¡Hola mundo! ✔

English Another great example
Spanish Otro gran ejemplo → Otro ejemplo perfecto ✔

English Open new tab
Spanish Abrir cuenta nueva → Abrir una nueva pestaña ✘

English 2025
Spanish 2025 → 2025 ✔

English 2026
Spanish 2.026 → 2026 ✔

… → … ✔

Input Source and translation		Output Translation and status
English `Hello world!` Spanish `Mundo de hola!`	→	`¡Hola mundo!` ✔
English `Another great example` Spanish `Otro gran ejemplo`	→	`Otro ejemplo perfecto` ✔
English `Open new tab` Spanish `Abrir cuenta nueva`	→	`Abrir una nueva pestaña` ✘
English `2025` Spanish `2025`	→	`2025` ✔
English `2026` Spanish `2.026`	→	`2026` ✔
…	→	… ✔

So a successful system is not just a model outputting raw scores, but more like a self-driving car app like Waymo. It creates concrete value, safely and simply, despite lots of complexity under the hood.

ModelFront actually takes responsibility for keeping human quality. That includes calibrating thresholds across tens of thousands of combinations of language and content type, evaluation, guardrails, automatic post-editing, transparent monitoring and managing the whole lifecycle of data and models over years.

Quality estimation inside a production AI system

In a production system for AI to check and fix AI translations and trigger human intervention where needed, quality estimation is inside the quality prediction model.

Example

system

integrations

TMS

API

human edit feedback loop

monitoring

retraining

evaluation

guardrails

models

quality prediction

quality estimation

threshold calibration

automatic post-editing

…

FAQ

Answers to frequently asked questions about machine translation quality estimation

Supported by the team at ModelFront — AI to check and fix AI translations, and trigger human intervention where needed

Quality estimation

Example

tl;dr

From research task to production system

Evolution of quality estimation

Example

Quality estimation inside a production AI system

Example

FAQ

Read more