Bernice: A Multilingual Pre-trained Encoder for Twitter

Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, Philip Resnik

Research output: Contribution to conferencePaperpeer-review

Abstract

The language of Twitter differs significantly from that of other domains commonly included in large language model training. While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter, or are trained on a limited amount of in-domain Twitter data. We introduce Bernice, the first multilingual RoBERTa language model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall. We posit that it is more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.

Original languageEnglish (US)
Pages6191-6205
Number of pages15
StatePublished - 2022
Externally publishedYes
Event2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates
Duration: Dec 7 2022Dec 11 2022

Conference

Conference2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period12/7/2212/11/22

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Bernice: A Multilingual Pre-trained Encoder for Twitter'. Together they form a unique fingerprint.

Cite this