Determining Urdu News Type from Headline Text Using Deep Learning

  • Umair Arshad School of Computing, Robert Gordon University, Aberdeen, United Kingdom
  • Khawar Iqbal Malik Riphah School of Computing & Innovation, Riphah International University, Lahore Campus, Pakistan
  • Hira Arooj Department of Mathematics and Statistics, University of Lahore, Sargodha Campus, Pakistan
  • Muhammad Fiaz Department of Computer Science, University of Lahore, Sargodha Campus, Pakistan
Keywords: deep learning (DL), fastText, natural language processing (NLP), Urdu news classification, Word2vec


In recent years, the volume of data of regional languages available on the Internet has grown significantly. It helps people to express themselves by removing linguistic boundaries. Moreover, the accessibility of news articles on the web provides billions of web users with a source of knowledge. This research offers a classification model for categorizing Urdu news headlines text with deep learning (DL) techniques and different word vector embeddings. To improve the efficacy of various Urdu natural language processing (NLP) applications, this study included two neural word embeddings built by utilizing the most widely used approaches, namely Word2vec and pre-trained fastText. Both intrinsic and extrinsic evaluation methods were used to examine the integrity of the created neural word embeddings. The study employed a vast, fresh corpus of Urdu text containing 153,050 headlines categorized into 8 different classes. Then, text pre-processing techniques and two DL models, namely the Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) were applied. The results were compared based on embeddings. It was found that when a pre-trained fastText embedding was utilized, BiLSTM surpassed other DL models with an accuracy of 93.93%, precision of 93.86%, recall of 93.93%, and F1 score of 93.89%.


