First Urdu language model bert based by Urduhack team

1 minute read


First Urdu language model bert based by Urduhack team.


Welcome to the world of Language models. Idea of transfer learning in NLP took the world by storm when google introduced it’s amazing language model called BERT. BERT set new State-of-the-art(SOTA) on many NLP benchmarks.

BERTs for different languages like chinese, french and german etc inspired us to put forward our efforts in pre-training a BERT like language model for urdu.

Introducing you to roberta-urdu-small. This is first language model for urdu language and we hope to release bigger and better models in future with different architectures like BERT/ALBERT etc. We have embarked upon this journey of language modelling and we hope to make a great contribution. We would love to see support and contribution from Machine Learning community.

Training Data

Huge amount of data is a pre-requisite for pre-training a language model. Data for urdu language is available mainly in news. We scraped data for the last 10 years from Pakistan’s popular urdu news papers and other resources also include.


We opt for roberta architecture instead of BERT for its better accuracy and a better pre-training approach. Roberta architecture is same like BERT except that it is just pre-trained on masked-language-mode (MLM) task instead of both MLM and next-sentence-prediction (NSP) task. NSP task does not contribute much to model performance on downstream tasks and removing it reduces computation complexity and training time.

Model was pre-trained for many epochs on P-100 Tesla GPU. It took long time to Train.


Let’s test our model on mlm task. In this task we randomly mask a word from our input sentence and the model predicts the masked word.

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="roberta-urdu-small", tokenizer="roberta-urdu-small")

results = fill_mask("حکومت کی <mask> رنگ لے آئیں") #masked_word = کوششیں
# {'sequence': '<s>حکومت کی کوششیں رنگ لے آئیں</s>', 'score': 0.07726366072893143, 'token': 4356}
# {'sequence': '<s>حکومت کی خدمات رنگ لے آئیں</s>', 'score': 0.05506495386362076, 'token': 2284}

Model contributor

  • (NLP library for Urdu language)

Leave a Comment