TY - GEN
T1 - Do explicit alignments robustly improve multilingual encoders?
AU - Wu, Shijie
AU - Dredze, Mark
N1 - Funding Information:
This research is supported in part by ODNI, IARPA, via the BETTER Program contract #2019-19051600005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
Publisher Copyright:
© 2020 Association for Computational Linguistics
PY - 2020
Y1 - 2020
N2 - Multilingual BERT (Devlin et al., 2019, mBERT), XLM-RoBERTa (Conneau et al., 2019, XLMR) and other unsupervised multilingual encoders can effectively learn cross-lingual representation. Explicit alignment objectives based on bitexts like Europarl or MultiUN have been shown to further improve these representations. However, word-level alignments are often suboptimal and such bitexts are unavailable for many languages. In this paper, we propose a new contrastive alignment objective that can better utilize such signal, and examine whether these previous alignment methods can be adapted to noisier sources of aligned data: a randomly sampled 1 million pair subset of the OPUS collection. Additionally, rather than report results on a single dataset with a single model run, we report the mean and standard derivation of multiple runs with different seeds, on four datasets and tasks. Our more extensive analysis finds that, while our new objective outperforms previous work, overall these methods do not improve performance with a more robust evaluation framework. Furthermore, the gains from using a better underlying model eclipse any benefits from alignment training. These negative results dictate more care in evaluating these methods and suggest limitations in applying explicit alignment objectives.
AB - Multilingual BERT (Devlin et al., 2019, mBERT), XLM-RoBERTa (Conneau et al., 2019, XLMR) and other unsupervised multilingual encoders can effectively learn cross-lingual representation. Explicit alignment objectives based on bitexts like Europarl or MultiUN have been shown to further improve these representations. However, word-level alignments are often suboptimal and such bitexts are unavailable for many languages. In this paper, we propose a new contrastive alignment objective that can better utilize such signal, and examine whether these previous alignment methods can be adapted to noisier sources of aligned data: a randomly sampled 1 million pair subset of the OPUS collection. Additionally, rather than report results on a single dataset with a single model run, we report the mean and standard derivation of multiple runs with different seeds, on four datasets and tasks. Our more extensive analysis finds that, while our new objective outperforms previous work, overall these methods do not improve performance with a more robust evaluation framework. Furthermore, the gains from using a better underlying model eclipse any benefits from alignment training. These negative results dictate more care in evaluating these methods and suggest limitations in applying explicit alignment objectives.
UR - http://www.scopus.com/inward/record.url?scp=85108393754&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85108393754&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85108393754
T3 - EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
SP - 4471
EP - 4482
BT - EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
T2 - 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020
Y2 - 16 November 2020 through 20 November 2020
ER -