Header shape illustration 1Header shape illustration 2
Back

What We Learned from Continually Training Minerva: A Case Study on Italian

Luca Moroni, Tommas Bonomo, Luca Gioffré, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè, Roberto Navigli

Abstract

Modern Large Language Models (LLMs) are commonly trained through a multi-stage pipeline encompassing pretraining and supervised finetuning. While recent studies have extensively investigated the benefits of continual pretraining on high-quality data, these efforts have focused primarily on English. In this work, we explore the effectiveness of various data mixtures in a continual pretraining setting to enhance performance on Italian-language tasks. Leveraging Minerva-7B, a fully opensource LLM pretrained on a corpus composed of 50% Italian, we define and evaluate three distinct data recipes-comprising mathematical, encyclopedic, and copyrighted content-spanning both Italian and English. We also investigate the effect of extending the model’s context window during continual pretraining on its ability to handle long-context tasks. To support our evaluation, we introduce INDAQA, a new benchmark for narrative question answering in Italian. Our results reveal that both data composition and increased context length substantially improve performance, offering valuable insights into continual pretraining strategies for less represented languages within an open scientific framework.

Your privacy choices

Save and continue
Sign up!
The best way to get the latest news from Babelscape and the NLP world!
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Thank you for subscribing!
You’ve been added to our mailing list, and you’ll receive our next newsletter to stay updated on the latest news from the NLP world!
Something went wrong
We are sorry, your request cannot be processed right now.
Please wait a bit and try again.
Unsubscribe
We're sorry to see you go. Please enter your email address to complete the unsubscription process.
You'll receive an email confirmation shortly.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Check your email
We have sent you a link to your email to complete the unsubscribe process.
Something went wrong
We are sorry, your request cannot be processed right now.
Please wait a bit and try again.