
By framing the problem as a bitext editing ( BitextEdit) task, we can perform a wide range of operations from copying good-quality bitext, to partial editing of small meaning mismatches, and translating from scratch incorrect references.įollowing previous extrinsic evaluations of bitext quality koehn-etal-2019-findings koehn-etal-2020-findings schwenk-etal-2021-ccmatrix schwenk-etal-2021-wikimatrix, we compare nmt models trained on the original and revised versions of CCMatrix bitexts. Our model takes as input a bitext (i.e., ( x f, x e )), and edits one of the two sentences to generate a refined version of the original (i.e., x ′ f or x ′ e) as necessary. We propose an editing approach to bitext quality improvement. In this paper, we instead aim to make use of as much of the signal from the mined bitext as possible. Our approach, alternatively, revises noisy bitexts via utilizing imperfect translations in a more effective way, while keeps the size of training data untouched. Filtering decreases the size of training samples which is crucial for low-resource nmt. Figure 1: Noisy bitexts consist of a mixture of good-quality, imperfect, and poor-quality translations.

This data can include errors that range from small meaning differences or partial mistranslations to major differences that yieldĬompletely incorrect translations and random noise (e.g., empty sequences, text in the wrong language, etc.). Models are often trained on heuristically aligned resnik-1999-mining banon-etal-2020-paracrawl espla-etal-2019-paracrawl or automatically mined data schwenk-etal-2021-wikimatrix schwenk-etal-2021-ccmatrix, which can be low quality briakou-carpuat-2020-detecting mashakane. Neural Machine Translation ( nmt) for low-resource languages is challenging due to the scarcity of bitexts (i.e., translated text in two languages) koehn-knowles-2017-six Up to 8 BLEU points, in most cases improving upon a competitiveīack-translation baseline. Mined bitext for 5 low-resource language-pairs and 10 translation directions by Experimentsĭemonstrate that our approach successfully improves the quality of CCMatrix Original translations and translate, in a multi-task fashion. Use a simple editing strategy by (1) mining potentially imperfect translationsįor each sentence in a given bitext, (2) learning a model to reconstruct the

Yields a more equivalent translation pair (i.e., or ). Translation of it xe, our model generates a revised version xf' or xe' that Our work, we propose instead, to refine the mined bitexts via automaticĮditing: given a sentence in a language xf, and a possibly imperfect Suboptimal in low-resource conditions where even mined data can be limited. Pairs out is known to improve final model quality, we argue that it is Training signals for Neural Machine Translation (NMT). Mined bitexts can contain imperfect translations that yield unreliable
