machine translation using hugging face pipelines

Part 1 : Evolution of Machine Translation (MT) Systems
Part 2 : Pretrained Models for Neural Machine Translation (NMT) Systems
Part 3 : Neural Machine Translation (NMT) using Hugging Face Pipeline
Part 4 : Neural Machine Translation (NMT) using EasyNMT library

💥 Why Hugging Face pipeline?

Once a pretrained language is fine-tuned for machine translation(MT), the model can be used for inference, i.e., we can use the fine-tuned model to translate the text sequences. MT model inference can be done in two ways, namely

without using pipelines by writing code for each step of model inference
using pipeline, which abstracts all the code required for each step of model inference.

We can observe that without using a pipeline, we have to write additional lines of code

#1 tokenizing the source text sequence
input = tokenizer(src_seq)
#2 generating the target sequence ids
tgt_seq_ids = model.generate(**input)
#3 generating the target sequence by decoding target sequence ids
generating the target sequence by decoding target sequence ids

Using a pipeline, we can do model inference in just a few lines of code.

#1 creating pipeline object
translate = pipeline('translation')
#2 get the target sequence by passing source sequence to pipeline object
tgt_seq = translate(src_seq)

Hugging Face Pipeline object abstracts the three steps of model inference, namely tokenization, generating target sequence ids, and decoding target sequence ids. So, when using a pipeline, we need not write any additional lines of code for these three steps.

💥 Hugging Face pipeline template for NMT inference

Now we will see the basic template for using Hugging Face pipeline for NMT inference.

Step 1 : Set the device to GPU if you want. By default, the pipeline uses CPU only.

#set the device
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

Step 2 : Load the tokenizer and fine-tuned model using AutoTokenizer and AutoModelForSeqtoSeqLM classes from the transformers library.

#load the model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
#load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name,src_lang, tgt_lang)

Step 3 : Create a pipeline object by passing the phrase “translation” along with the tokenizer and model objects.

#create the pipeline object
translator = pipeline('translation', model=model, 
tokenizer = tokenizer,src_lang="hi", tgt_lang="en",device=device)

Step 4 : Get the target sequence by passing the source sequence to the pipeline object. Along with the source sequence, we can also pass other arguments like max_length, min_length, etc. which control the generation of the target sequence.

#get the target sentence
target_seq = translator(text, max_length=128)

We have to keep two things in mind namely

💡 While using SDMT models (Ex: OPUS MT) for inference, we need not specify source and target language ids while loading the tokenizer or creating the pipeline object. This is because these models can translate the data in only in direction.

💡 While using MDMT models (Ex:mBART50-OM, mBART50-MO, mBART50-MM, NLLB200, M2M100), we need to explicitly specify source and target language ids while loading the tokenizer and creating the pipeline object. This is because these models can translate the data in multiple directions. Without specifying the language ids, the model is unable to know in which direction the translation has to be done. In the case of these models, if language ids are not specified, an error will be thrown.

💥 NMT inference using various pretrained models

Now let us see how to use Hugging Face pipeline for MT inference using various models like OPUS-MT, mBART50-MO, mBART50-MM, M2M100, and NLLB200.

🔥 Install and import libraries

First download the necessary the libraries like transformers, sentencepiece and sacremoses.

!pip install git+https://github.com/huggingface/transformers -q
!pip install  sentencepiece sacremoses -q

Import the necessary libraries and classes

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import torch

Here will consider the following sentence in the Hindi language. We will use all these models to translate it to English and check how well these models can translate.

text = "राहुल को उनकी शानदार बल्लेबाजी और क्षेत्ररक्षण के लिए मैन ऑफ द मैच का पुरस्कार दिया गया।"

🔥 OPUS-MT

As we want to translate a sentence from Hindi to English, we have to use the OPUS-MT model (Helsinki-NLP/opus-mt-hi-en) with Hindi (hi) as the source language and English (en) as the target language.

#set the device
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

#load the model
model_name = 'Helsinki-NLP/opus-mt-hi-en'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

#create pipeline object
translator = pipeline('translation', model=model, tokenizer=tokenizer,device=device)

#get the translated sentence
summary = translator(text,max_length=128)

#display the translated sentence
print(summary[0]['translation_text'])
# output : Rahul was given the award of the Man of the Match to save his grand bat and territory.

Here we can observe that we don't specify source and language ids while loading the tokenizer or creating the pipeline object. This is because OPUS-MT models are SDMT (single-direction machine translation) which can translate data in one direction only.

🔥 mBART-MO

Here we want to translate a sentence from Hindi to English. mBART50-MO model can translate a sentence from any of the supported languages to English. So we can use the model facebook/mbart-large-50-many-to-one-mmt to translate the sentence from Hindi to English.

#set the device
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

#load the model
model_name = 'facebook/mbart-large-50-many-to-one-mmt'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="hi_IN", tgt_lang="en_XX")

#create the pipeline object
translator = pipeline('translation_XX_to_YY', model=model, tokenizer=tokenizer,src_lang="hi_IN", tgt_lang="en_XX",device=device) 

#get the target sequence
target_seq = translator(text, max_length=128)

#display the target sequence
print(target_seq[0]['translation_text'].strip('YY '))
#output: Rahul was awarded the Man of the Match award for his outstanding batting and fielding.

Here we can observe that we specify source and language ids while loading the tokenizer and creating the pipeline object. This is because mBART models are MDMT (multi-direction machine translation) models which can translate data in multiple directions. Without specifying source and language ids in the case of MDMT models will result in an error.

🔥 mBART-MM

Here we want to translate a sentence from Hindi to English. mBART50-MO model can translate sentences from any of the supported languages to any of the supported languages. So we can use the model facebook/mbart-large-50-many-to-many-mmt to translate the sentence from Hindi to English.

#set the device
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

#load the model
model_name = 'facebook/mbart-large-50-many-to-many-mmt'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="hi_IN", tgt_lang="en_XX")

#create the pipeline object
translator = pipeline('translation_XX_to_YY', model=model, tokenizer=tokenizer,src_lang="hi_IN", tgt_lang="en_XX",device=device)

#get the target sentence
target_seq = translator(text, max_length=128)

#display the target sentence
print(target_seq[0]['translation_text'].strip('YY '))
#output : Rahul was awarded Man of the Match for his outstanding batting and fielding.

Here we also can observe that we specify source and language ids while loading the tokenizer and creating the pipeline object. This is because mBART models are MDMT (multi-direction machine translation) models which can translate data in multiple directions. Without specifying source and language ids in the case of MDMT models will result in an error.

🔥 M2M100-418M

Here we want to translate a sentence from Hindi to English. M2M100 model can translate sentences from any of the supported languages to any of the supported languages. So we can use the model facebook/m2m100_418M to translate the sentence from Hindi to English.

#set the device
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

#load the model
model_name = 'facebook/m2m100_418M'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="hi", tgt_lang="en")

#create the pipeline object
translator = pipeline('translation', model=model, tokenizer=tokenizer,src_lang="hi", tgt_lang="en",device=device)   

#get the target sentence
target_seq = translator(text, max_length=128)

#display the target sentence
print(target_seq [0]['translation_text'])
#output : Raul was awarded the Man of the Match Award for his brilliant balloons and territory protection.

Here we also can observe that we specify source and language ids while loading the tokenizer and creating the pipeline object. This is because M2M100 is an MDMT (multi-direction machine translation) model which can translate data in multiple directions. Without specifying source and language ids in the case of MDMT models will result in error.

🔥 M2M100-1.2B

Here we want to translate a sentence from Hindi to English. M2M100 model can translate sentence from any of the supported languages to any of the supported languages. So we can use the model facebook/m2m100_1.2B to translate the sentence from Hindi to English.

#Set the device
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

#Load the model
model_name = 'facebook/m2m100_1.2B'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="hi", tgt_lang="en")

#Create pipeline object
translator = pipeline('translation', model=model, tokenizer=tokenizer,src_lang="hi", tgt_lang="en",device=device) 

#Translate source text sequence
target_seq = translator(text, max_length=128)
print(target_seq [0]['translation_text'])
#output: Raúl was awarded the Man of the Match Award for his brilliant battle and territory protection.

🔥 NLLB200-600M

Here we want to translate a sentence from Hindi to English. NLLB200 model can translate sentences from any of the supported languages to any of the supported languages. So we can use the model facebook/nllb-200-distilled-600M to translate the sentence from Hindi to English.

#set the device
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

#load the model
model_name = 'facebook/nllb-200-distilled-600M'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="hin_Deva", tgt_lang="eng_Latn")

#create the pipeline object
translator = pipeline('translation', model=model, tokenizer=tokenizer,src_lang="hin_Deva", tgt_lang="eng_Latn",device=device)  

#get the target sentence
target_seq = translator(text, max_length=128)

#display the target sentence
print(target_seq [0]['translation_text'])
#output : Rahul was awarded Man of the Match for his outstanding batting and fielding.

Here we also can observe that we specify source and language ids while loading the tokenizer and creating the pipeline object. This is because NLLB200 is an MDMT multi-directionn machine translation) model which can translate data in multiple directions. Without specifying source and language ids in the case of MDMT models will resultan in error.

🔥 NLLB200-1.3B

Here we want to translate a sentence from Hindi to English. NLLB200 model can translate sentences from any of the supported languages to any of the supported languages. So we can use the model facebook/nllb-200-distilled-1.3B to translate the sentence from Hindi to English.

#Set the device
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

#Load the model
model_name = 'facebook/nllb-200-distilled-1.3B'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="hin_Deva", tgt_lang="eng_Latn")

#Create pipeline object
translator = pipeline('translation', model=model, tokenizer=tokenizer,src_lang="hin_Deva", tgt_lang="eng_Latn",device=device) 

#Translate source text sequence
target_seq = translator(text, max_length=128)
print(target_seq [0]['translation_text'])
#output: Rahul was awarded Man of the Match for his brilliant batting and fielding.

Here we also can observe that we specify source and language ids while loading the tokenizer and creating the pipeline object. This is because NLLB200 is an MDMT (multi-direction machine translation) model which can translate data in multiple directions. Without specifying source and language ids in the case of MDMT models will result in an error.

🔥 Discussion

Here we can observe that mBART-50 based models and NLLB200 models are able to generate translated sequences that are close to the original target sequence.

This may be because mBART-50 models are fine-tuned on English-centric machine translation datasets, so they can do better translations that involve the English language.
Though NLLB200 models are not English-centric, as these models are trained on comparatively large volumes of parallel data, the models are able to translate well.

Neural Machine Translation using Hugging Face Pipeline