TY - GEN
T1 - What Makes Data-to-Text Generation Hard for Pretrained Language Models?
AU - Keymanesh, Moniba
AU - Benton, Adrian
AU - Dredze, Mark
N1 - Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - Expressing natural language descriptions of structured facts or relations - data-to-text generation (D2T) - increases the accessibility of structured knowledge repositories. Previous work (Nan et al., 2020) shows that pre-trained language models (PLMs) perform remarkably well on this task after fine-tuning on a significant amount of task-specific training data. On the other hand, while auto-regressive PLMs can generalize from a few task examples, their efficacy at D2T is largely unexplored. Furthermore, we have an incomplete understanding of the limits of PLMs on D2T. In this work, we conduct an empirical study of both fine-tuned and auto-regressive PLMs on the DART multi-domain D2T dataset. We consider their performance as a function of the amount of task-specific data and how the data is incorporated into the models: zero and few-shot learning, and fine-tuning of model weights. In addition, we probe the limits of PLMs by measuring performance on subsets of the evaluation data: novel predicates and abstractive test examples. To improve the performance on these subsets, we investigate two techniques: providing predicate descriptions in the context and re-ranking generated candidates by information reflected in the source. Finally, we conduct a human evaluation of model errors and show that D2T generation tasks would benefit from datasets with more careful manual curation.
AB - Expressing natural language descriptions of structured facts or relations - data-to-text generation (D2T) - increases the accessibility of structured knowledge repositories. Previous work (Nan et al., 2020) shows that pre-trained language models (PLMs) perform remarkably well on this task after fine-tuning on a significant amount of task-specific training data. On the other hand, while auto-regressive PLMs can generalize from a few task examples, their efficacy at D2T is largely unexplored. Furthermore, we have an incomplete understanding of the limits of PLMs on D2T. In this work, we conduct an empirical study of both fine-tuned and auto-regressive PLMs on the DART multi-domain D2T dataset. We consider their performance as a function of the amount of task-specific data and how the data is incorporated into the models: zero and few-shot learning, and fine-tuning of model weights. In addition, we probe the limits of PLMs by measuring performance on subsets of the evaluation data: novel predicates and abstractive test examples. To improve the performance on these subsets, we investigate two techniques: providing predicate descriptions in the context and re-ranking generated candidates by information reflected in the source. Finally, we conduct a human evaluation of model errors and show that D2T generation tasks would benefit from datasets with more careful manual curation.
UR - http://www.scopus.com/inward/record.url?scp=85152950694&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85152950694&partnerID=8YFLogxK
U2 - 10.18653/v1/2022.gem-1.50
DO - 10.18653/v1/2022.gem-1.50
M3 - Conference contribution
AN - SCOPUS:85152950694
T3 - GEM 2022 - 2nd Workshop on Natural Language Generation, Evaluation, and Metrics, Proceedings of the Workshop
SP - 539
EP - 554
BT - GEM 2022 - 2nd Workshop on Natural Language Generation, Evaluation, and Metrics, Proceedings of the Workshop
PB - Association for Computational Linguistics (ACL)
T2 - 2nd Workshop on Natural Language Generation, Evaluation, and Metrics, GEM 2022, as part of EMNLP 2022
Y2 - 7 December 2022
ER -