NLP Pipeline

large-scale, parallel, multilingual and modularized

Process text in many languages!

Our multilingual NLP Pipeline is based on a flexible API which enables effective end-to-end processing of text in the following languages:

  • Arabic
  • Chinese
  • Dutch
  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Polish
  • Portuguese
  • Russian
  • Spanish

Multilinguality is a key feature of our pipeline, with most modules available in 13 languages. Moreover, we feature:

  • parallelism of independent modules
  • modularization, with an effective pipeline customized to each specific need
  • large scale, making it possible to process millions of texts in seconds
  • availability both as an online service and as an offline software package


Our multilingual Natural Language Processing pipeline includes modules which perform the following tasks, which can be accessed separately and are integrated into the pipeline:

  • Language recognition
  • Tokenization
  • Morphological analysis
  • Part-of-speech tagging
  • Named Entity Recognition
  • Term, concept and entity extraction
  • Domain labeling
  • Tag classification
  • Word Sense Disambiguation and Entity Linking
  • Semantic vector document creation
  • Semantic document similarity of sentences, paragraphs and documents
  • Sentiment analysis


Babelscape’s NLP pipeline comes with several groundbreaking features. It is designed to work on a large scale in dozens of languages using the same interface for each language. Users can choose only the modules they need and can run dozens of tasks in parallel on the same CPU. The pipeline also integrates our flagship products as modules: WordAtlas, Comprehendo and Extraggo, thanks to which a full-fledged analysis of text can be performed, ranging from tokenization to semantic analysis and text analytics.

Product comparison

We compared the time performance of our multilingual pipeline with two strong competitors, namely the Stanford CoreNLP and the NLTK libraries, on gold-standard data. The results reported in the Table show that our pipeline is faster and more accurate than its alternatives.

Babelscape CoreNLP NLTK
Language recognition 1.13ms - -
Tokenization 0.15ms 0.45ms 4.34ms
Morphological analysis 4.32ms 39.92ms 26.90ms
Part-of-speech tagging 2.56ms 20.09ms 24.24ms
Named Entity Recognition 14.81ms 158.59ms 143.66ms
*Tested on a Intel® Core™ i7-6700HQ with 16 GB ram DDR3.