Machine Translation

Bitext Refinement

Mined bitexts can contain imperfect translations that yield unreliable training signals for Neural Machine Translation. While filtering such pairs out is known to improve final model quality, it is suboptimal in low-resource conditions where even mined data can be limited. Can we do better? How can we improve machine translation by refining its data?

Divergences in Machine Translation

Parallel texts—a source paired with its (human) translation—are routinely used for training machine translation systems assuming they are equivalent in meaning. Yet parallel texts might contain semantic divergences. How do those divergences interact with neural machine translation training and evaluation? How can we calibrate our assumptions to model parallel texts better?