Determining Urdu News Type from Headline Text Using Deep Learning
Abstract
Abstract Views: 0In recent years, the volume of data of regional languages available on the Internet has grown significantly. It helps people to express themselves by removing linguistic boundaries. Moreover, the accessibility of news articles on the web provides billions of web users with a source of knowledge. This research offers a classification model for categorizing Urdu news headlines text with deep learning (DL) techniques and different word vector embeddings. To improve the efficacy of various Urdu natural language processing (NLP) applications, this study included two neural word embeddings built by utilizing the most widely used approaches, namely Word2vec and pre-trained fastText. Both intrinsic and extrinsic evaluation methods were used to examine the integrity of the created neural word embeddings. The study employed a vast, fresh corpus of Urdu text containing 153,050 headlines categorized into 8 different classes. Then, text pre-processing techniques and two DL models, namely the Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) were applied. The results were compared based on embeddings. It was found that when a pre-trained fastText embedding was utilized, BiLSTM surpassed other DL models with an accuracy of 93.93%, precision of 93.86%, recall of 93.93%, and F1 score of 93.89%.
Downloads
References
M. Iqbal, B. Tahir, and M. A. Mehmood, "CURE: Collection for Urdu information retrieval evaluation and ranking," in Int. Conf. Digit. Fut. Transform. Technol., May 2021, pp. 1–6, doi: https://doi.org/10.48550/arXiv.2011.00565.
A. Daud, W. Khan, and D. Che, "Urdu language processing: A survey," Artif. Intell. Rev., vol. 47, pp. 279–311, 2017, doi: https://doi.org/10.1007/s10462-016-9482-x.
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, "Bag of tricks for efficient text classification," arXiv, arXiv:1607.01759, 2016, doi: https://doi.org/10.48550/arXiv.1607.01759
I. Rasheed, H. Banka, and H. M. Khan, "A hybrid feature selection approach based on LSI for classification of Urdu text," in Machine Learning Algorithms for Industrial Applications, S. Das, S. Das, N. Dey, and A. E. Hassanien, Eds., Springer, 2021, pp. 3–18, 2021, doi: https://doi.org/10.1007/978-3-030-50641-4_1
I. Rasheed, V. Gupta, H. Banka, and C. Kumar, "Urdu text classification: A comparative study using machine learning techniques," in 13th Int. Conf. Digit. Inform. Manag., Sep. 2018, pp. 274–278, doi: https://doi.org/10.1109/ICDIM.2018.8847044
T. B. Shahi and A. K. Pant, "Nepali news classification using Naive Bayes, support vector machines and neural networks," in Int. Conf. Commun. Info. Comput. Technol., Feb. 2018, pp. 1–5, doi: https://doi.org/10.1109/ICCICT.2018.8325883
K. I. Malik, "Urdu news content classification using machine learning algorithms," Lahore Garri. Univ. Res. J. Comput. Sci. Info. Technol., vol. 6, no. 1, pp. 22-31, 2022, doi: https://doi.org/10.54692/lgurjcsit.2022.0601274
M. N. Asim, M. U. Ghani, M. A. Ibrahim, W. Mahmood, A. Dengel, and S. Ahmed, "Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification," Neural. Comput. Applic. vol. 33, pp. 5437–5469, 2021, doi: https://doi.org/10.1007/s00521-020-05321-8
A. Elnagar, R. Al-Debsi, and O. Einea, "Arabic text classification using deep learning models," Info. Process. Manag., vol. 57, no. 1, Article no. 102121, 2020.
J. Xie et al., "Chinese text classification based on attention mechanism and feature-enhanced fusion neural network," Computing, vol. 102, pp. 683–700, 2020, doi: https://doi.org/10.1007/s00607-019-00766-9
J. A. Díaz-García, C. Fernandez-Basso, M. D. Ruiz, and M. J. Martin-Bautista, "Mining text patterns over fake and real tweets," in Int. Conf. Info. Process. Manag. Uncert. Knowledge-Based Syst., 2020, pp. 648–660, Springer, doi: https://doi.org/10.1007/978-3-030-50143-3_51
J. Gong et al., "Hierarchical graph transformer-based deep learning model for large-scale multi-label text classification," IEEE Access, vol. 8, pp. 30885–30896, 2020, doi: https://doi.org/10.1109/ACCESS.2020.2972751
X. Xiao, S. Lian, Z. Luo, and S. Li, "Weighted res-unet for high-quality retina vessel segmentation," in 9th Int. Conf. Info. Technol. Med. Edu., 2018, pp. 327–331, doi: https://doi.org/10.1109/ITME.2018.00080
M. A. Ramdhani, D. S. A. Maylawati, and T. Mantoro, "Indonesian news classification using convolutional neural network," Indo. J. Elect. Eng. Comput. Sci., vol. 19, no. 2, pp. 1000–1009, 2020.
S. R. Sahoo and B. B. Gupta, "Multiple features based approach for automatic fake news detection on social networks using deep learning," Appl. Soft Comput., vol. 100, Article e106983, 2021, doi: https://doi.org/10.1016/j.asoc.2020.106983
I. C. Irsan and M. L. Khodra, "Hierarchical multilabel classification for Indonesian news articles," in Int. Conf. Adv. Info. Concept. Theo. Appl., 2016, pp. 1–6, doi: https://doi.org/10.1109/ICAICTA.2016.7803108
Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, "Hierarchical attention networks for document classification," in Proc. 2016 Conf. North Am. Chap. Assoc. Comput. Linguist. Human Lang. Technol., 2016, pp. 1480–1489.
I. Safder et al., "Sentiment analysis for Urdu online reviews using deep learning models," vol. 38, no. 8, p. e12751, 2021, doi: https://doi.org/10.1111/exsy.12751
K. Ahmed, M. Ali, S. Khalid, and M. Kamran, "Framework for Urdu News headlines classification," J. Appl. Comput. Sci. Mathemat., no. 21, 2016, doi: https://doi.org/10.1111/exsy.12751
S. A. Hamza, B. Tahir, and M. A. Mehmood, "Domain identification of urdu news text," in 22nd Int. Multi. Conf., 2019, pp. 1–7, doi: https://doi.org/10.1109/INMIC48123.2019.9022736
T. A. Javed, W. Shahzad, and U. Arshad, "Hierarchical text classification of urdu news using deep neural network," 2021, doi: https://doi.org/10.48550/arXiv.2107.03141
M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, and M. Fayyaz, "Exploring deep learning approaches for Urdu text classification in product manufacturing," Enter. Info. Syst., vol. 16, no. 2, pp. 223–248, 2022, doi: https://doi.org/10.1080/17517575.2020.1755455
U. Naqvi, A. Majid, and S. A. Abbas, "UTSA: Urdu text sentiment analysis using deep learning methods," IEEE Access, vol. 9, pp. 114085–114094, 2021, doi: https://doi.org/10.1109/ACCESS.2021.3104308
H. Liu, "Sentiment analysis of citations using word2vec," arXiv, arXiv:1704.00177, 2017, doi: https://doi.org/10.48550/arXiv.1704.00177
D. Zhang, H. Xu, Z. Su, and Y. Xu, "Chinese comments sentiment classification based on word2vec and SVMperf," Expert Syst. Appl., vol. 42, no. 4, pp. 1857–1863, 2015, doi: https://doi.org/10.1016/j.eswa.2014.09.011
H. Peng, Y. Song, and D. Roth, "Event detection and co-reference with minimal supervision," in Proc. 2016 Conf. Empiri. Methods Nat. Lang. Process., 2016, pp. 392–402.
F. Mehmood, M. U. Ghani, M. A. Ibrahim, R. Shahzadi, W. Mahmood, and M. N. Asim, "A precisely xtreme-multi channel hybrid approach for Roman Urdu sentiment analysis," IEEE Access, vol. 8, pp. 192740–192759, 2020, doi: https://doi.org/10.1109/ACCESS.2020.3030885
Copyright (c) 2023 Umair Arshad, Khawar Iqbal Malik, Hira Arooj, Muhammad Fiaz
This work is licensed under a Creative Commons Attribution 4.0 International License.
UMT-AIR follow an open-access publishing policy and full text of all published articles is available free, immediately upon publication of an issue. The journal’s contents are published and distributed under the terms of the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Thus, the work submitted to the journal implies that it is original, unpublished work of the authors (neither published previously nor accepted/under consideration for publication elsewhere). On acceptance of a manuscript for publication, a corresponding author on the behalf of all co-authors of the manuscript will sign and submit a completed the Copyright and Author Consent Form.