Solutions
Large Language Models
Empower your content with human-like text generation across languages.
Minerva AI
Agentic AI
Retrieval-Augmented Generation (RAG)
Model Fine Tuning
AI Evaluation
More about LLM
Text Analytics
Unlock actionable insights from unstructured text for strategic decision-making.
Named Entity Recognition
Keyword Extraction
Relation Extraction
Entity Linking
Sentiment Analysis
Word Sense Disambiguation
More about Text Analytics
Knowledge Graphs
Search, visualize and explore data connections for deep insights and complex queries.
Next-Generation KG
Rich Semantic Information
Custom Enterprise KG Development
More about Knowledge Graphs
Semantic Search
Refine searches with context-aware results that understand user intent multilingually.
Advanced Query Understanding
Contextual Results Ranking
Customizable Search Framework
Semantic Annotation
More about Semantic Search
Minerva AI
Agentic AI
Retrieval-Augmented Generation (RAG)
Model Fine Tuning
AI Evaluation
Named Entity Recognition
Keyword Extraction
Relation Extraction
Entity Linking
Sentiment Analysis
Word Sense Disambiguation
Next-Generation KG
Rich Semantic Information
Custom Enterprise KG Development
Advanced Query Understanding
Contextual Results Ranking
Customizable Search Framework
Semantic Annotation
More about LLM
More about Text Analytics
More about Knowledge Graphs
More about Semantic Search
Solutions
Products
Babelscape Vera
LLM-powered, grounded fact-checking
WordAtlas
the next-generation multilingual knowledge graph
Comprehendo
disambiguate and semantically tag text in hundreds of languages
Extraggo
extract knowledge from text and analyze key concepts, entities and domains
Emotionary
the next-generation language abuse and emotion detection AI
NLP Pipeline
large-scale, parallel, multilingual and modularized
Semantic Paths
A multilingual semantic search engine and information monitor
LexTag
Create your semantically-annotated datasets with ease
myKnowledgeGraph
organize your enterprise documents into a structured knowledge base
TraDeInterpret
Revolutionize the way you work with trademark denominations
Products
Research
About
News
API & Demos
Explore Babelscape's API Console

Register to get a free API key or purchase one to access our powerful multilingual AI solutions. Test live demos, experience entity linking, semantic search, and more - unlocking the full potential of AI-powered text understanding for your industry.

APIs Console
Discover Babelscape's techology in action

See firsthand how our products can transform your business by providing advanced multilingual understanding, entity linking, semantic search, and more. Explore the demos below and unlock the potential of AI-driven solutions tailored to your needs.

Live demos
API & Demos
Contact us Contact us

More about Text Analytics

More about Knowledge Graphs

More about Semantic Search

Babelscape Vera

LLM-powered, grounded fact-checking

the next-generation multilingual knowledge graph

disambiguate and semantically tag text in hundreds of languages

extract knowledge from text and analyze key concepts, entities and domains

the next-generation language abuse and emotion detection AI

large-scale, parallel, multilingual and modularized

A multilingual semantic search engine and information monitor

Create your semantically-annotated datasets with ease

myKnowledgeGraph

organize your enterprise documents into a structured knowledge base

Revolutionize the way you work with trademark denominations

Explore Babelscape's API Console

Register to get a free API key or purchase one to access our powerful multilingual AI solutions. Test live demos, experience entity linking, semantic search, and more - unlocking the full potential of AI-powered text understanding for your industry.

Discover Babelscape's techology in action

See firsthand how our products can transform your business by providing advanced multilingual understanding, entity linking, semantic search, and more. Explore the demos below and unlock the potential of AI-driven solutions tailored to your needs.

Header shape illustration 1

Header shape illustration 2

Back

What We Learned from Continually Training Minerva: A Case Study on Italian

Luca Moroni, Tommas Bonomo, Luca Gioffré, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè, Roberto Navigli

Abstract

Modern Large Language Models (LLMs) are commonly trained through a multi-stage pipeline encompassing pretraining and supervised finetuning. While recent studies have extensively investigated the benefits of continual pretraining on high-quality data, these efforts have focused primarily on English. In this work, we explore the effectiveness of various data mixtures in a continual pretraining setting to enhance performance on Italian-language tasks. Leveraging Minerva-7B, a fully opensource LLM pretrained on a corpus composed of 50% Italian, we define and evaluate three distinct data recipes-comprising mathematical, encyclopedic, and copyrighted content-spanning both Italian and English. We also investigate the effect of extending the model’s context window during continual pretraining on its ability to handle long-context tasks. To support our evaluation, we introduce INDAQA, a new benchmark for narrative question answering in Italian. Our results reveal that both data composition and increased context length substantially improve performance, offering valuable insights into continual pretraining strategies for less represented languages within an open scientific framework.

https://babelscape.com/assets/uploads/docs/CLiC_it_2025__Continual_Training.pdf
Luca Moroni, Tommas Bonomo, Luca Gioffré, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè, Roberto Navigli. 2025. What We Learned from Continually Training Minerva: A Case Study on Italian. Procedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2025), Cagliari, Italia. CLiC-it.