Abstract
Modern Large Language Models (LLMs) are commonly trained through a multi-stage pipeline encompassing pretraining and supervised finetuning. While recent studies have extensively investigated the benefits of continual pretraining on high-quality data, these efforts have focused primarily on English. In this work, we explore the effectiveness of various data mixtures in a continual pretraining setting to enhance performance on Italian-language tasks. Leveraging Minerva-7B, a fully opensource LLM pretrained on a corpus composed of 50% Italian, we define and evaluate three distinct data recipes-comprising mathematical, encyclopedic, and copyrighted content-spanning both Italian and English. We also investigate the effect of extending the model’s context window during continual pretraining on its ability to handle long-context tasks. To support our evaluation, we introduce INDAQA, a new benchmark for narrative question answering in Italian. Our results reveal that both data composition and increased context length substantially improve performance, offering valuable insights into continual pretraining strategies for less represented languages within an open scientific framework.
- Luca Moroni, Tommas Bonomo, Luca Gioffré, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè, Roberto Navigli. 2025. What We Learned from Continually Training Minerva: A Case Study on Italian. Procedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2025), Cagliari, Italia. CLiC-it.