Saraiki Language Hybrid Stemmer Using Rule-Based and LSTM-Based Sequence-To-Sequence Model Approach

  • Mubasher H. Malik Department of Computer Science, Institute of Southern Punjab, Multan, Pakistan
  • Hamid Ghous Australian Scientific & Engineering Solutions, Sydney, New South Wales, Australia
  • Iqra Ahsan Department of Computer Science, Institute of Southern Punjab, Multan, Pakistan
  • Maryem Ismail Department of Computer Science, Institute of Southern Punjab, Multan, Pakistan
Keywords: Terms-Hybrid Stemmer, LSTM, Rule-based Stemmer, Saraiki, Stemming

Abstract

Abstract Views: 116

Converting a word to its original form, is called stemming, which is extremely important in the field of Natural language processing (NLP). It’s an integral part of the linguistic pre-processing of every Natural language processing application. Stemming converts inflectional word forms into their root word. Much work has been done for stemming in different national and regional languages like English, French, Arabic, German, Urdu, and Hindi. Many regional languages still need work to build digital resources using Natural language processing. Saraiki is one of the widely spoken regional languages in Pakistan. Almost eighty million people use this language for communication. There are very limited digital resources using the Saraiki language available to support advancement in Natural language processing technologies. The current research aims to propose a hybrid stemmer to stem Saraiki Work. The hybrid stemmer contains two hundred prefix and postfix rules and Long short-term memory based sequence-to-sequence model for converting Saraiki words into the stem. Firstly, Saraiki text * Corresponding Author: [email protected] was pre-processed, and a rule set was implemented. Secondly, the Long short-term memory based sequence-to-sequence model was deployed to stem the Saraiki word correctly. In the last step, The Saraiki Stemmer performance was evaluated by accurately finding stem word accuracy using a rule-set and Long short-term memory sequence to sequence model. After experiments, using the rule set correctly, stem word accuracy was 68.53%, while the Long short-term memory based sequence-to-sequence model produced 93.0% accuracy of correctly stem words. This work contributes significantly to the regional linguistic field by introducing stemmer for the Saraiki language.

Downloads

Download data is not yet available.

References

E. Bashir and T. J. Conners "A descriptive grammar of Hindko, Panjabi, and Saraiki," in A Descriptive Grammar of Hindko, Panjabi, and Saraiki, De Gruyter Mouton, 2019.

Z. L. Atta, "Saraiki," J. Int. Phonetic Association, pp. 1– 21, 2020.

A. H. Dani, "Sindhu-Sauvira: A glimpse into the early history of Sind," in Sind Through the Centuries, Karachi: Oxford University Press, 1981, pp. 35–42.

T. Rahman, "Language and politics in Pakistan," Oxford University Press, 1996.

R. S. Hashmi and G. Majeed, "Saraiki ethnic identity: Genesis of conflict with state," J Poli. Stud., vol. 21, no. 1, pp. 79–101, 2014.

M. A. Wagha, "The development of Siraiki language in Pakistan," Ph.D. dessertation, Sch. Orient. African Stud., Univ. London, UK, 1997.

C. Shackle, "The Siraiki language of central Pakistan: A reference grammar," Sch. Orient. African Stud., Univ. of London, 1976.

J. Hirschberg and C. D. Manning, "Advances in natural language processing," Science, vol. 349, no. 6245, pp. 261–266, Jul. 2015.

J. H. Paik and S. K. Parui, "A fast corpus-based stemme," ACM Transac. Asian Lang. Inform. Proce., vol. 10, no. 2, pp. 1–16, June 2011.

C. Parsing, "Speech and language processing," Power Point Slides, 2009,

D. Khurana, A. Koli, K. Khattar, and S. Singh, "Natural language processing: State of the art, current trends and challenges," Multimed. Tools Applic., pp. 1–32, July 2022, doi:https://doi.org/10.1007/s11042-022-13428-4

B. P. King, "Practical Natural Language Processing for Low-Resource Languages," Ph.D. thesis, Uni., Michigan, 2015.

M. A. Hedderich, L. Lange, H. Adel, J. Strötgen, and D. Klakow, "A survey on recent approaches for natural language processing in low-resource scenarios," arXiv preprint arXiv:2010.12309, 2020,

S. S. Hussain, "The growth of Saraiki language," Pakistan J. Soc. Sci., vol. 36, no. 1, pp. 387–396, 2021.

G. Raza, "Reduction of compound adpositions in Persian, Urdu and Saraiki," in 6th Int. Contras. Lingu. Conf., Berlin, 2010.

D. Bijal and S. Sanket, "Overview of stemming algorithms for Indian and Non-Indian languages," arXiv preprint arXiv:1404.2878, 2014, doi: https://doi.org/10.48550/arXiv.1404.2878

B. A. Pande and H. S. Dhami, "Application of natural

language processing tools in stemming," Int. J. Comput. Applic., vol. 27, no. 6, pp. 14–19, 2011.

D. Khyani, B. S. Siddhartha, N. M. Niveditha, and B. M. Divya, "An Interpretation of Lemmatization and Stemming in Natural Language Processing," J. Univ. Shanghai Sci. Technol., vol. 22, no. 10, pp. 350–357, 2021.

S. Jusoh, "A study on nlp applications and ambiguity problems," J. Theoret. Appl. Inform. Technol., vol. 96, no. 6, Mar. 2018.

P. Deshpande and S. Jahirabadkar, "A survey on statistical approaches for abstractive summarization of low resource language documents," in Smart Trend Comput. Commun., Springer, 2022, pp. 729–738, doi: https://doi.org/10.1007/978-981-16-4016-2_69

C. Moral, A. de Antonio, R. Imbert, and J. Ramírez, "A survey of stemming algorithms in information retrieval," Inform. Res., vol. 19, no. 1, Mar. 2014.

A. S. Rizki, A. Tjahyanto, and R. Trialih, "Comparison of stemming algorithms on Indonesian text processing," TELKOMNIKA, vol. 17, no. 1, pp. 95–102, 2019.

S. R. Payne, J. Kodner, and C. Yang, "Learning Morphological Productivity as Meaning-Form Mappings," Proc. Soc. Comput. Ling., vol. 4, no. 1, pp. 177–187, 2021, doi: https://doi.org/10.7275/rbhm -c353

K. Swain and A. K. Nayak, "A review on rule-based and hybrid stemming techniques," Int. Conf. Data Sci. Business Anal., IEEE, Sep. 21–23, 2018, pp. 25–29, doi: https://doi.org/10.1109/ICDS BA.2018.00012

B. Gobin-Rahimbux, I. Maudhoo, and N. Gooda Sahib, "KreolStem: A hybrid language-dependent stemmer for Kreol Morisien," J. Exper. Theoret. Artif. Intell., pp. 1– 19, Jan. 2023, doi: https://doi.org/10.1080/0952 813X.2023.2165714

F. S. Alotaibi and V. Gupta, "A cognitive inspired unsupervised language-independent text stemmer for Information retrieval," Cogn. Sys. Res., vol. 52, pp. 291– 300, Dec. 2018, doi: https://doi.org/10.1016/j.cogs ys.2018.07.003

M .E. Basiri and A. Kabiri, "HOMPer: A new hybrid system for opinion mining in the Persian language," J. Info. Sci., vol. 46, no. 1, pp. 101– 117, 2020, doi: https://doi.org/10.1177/0165 551519827886

P. Vaishali Kadam, B. Kalpana Khandale, and C. Namrata Mahender, "Design and development of marathi word stemmer," in Proc. Second Int. Conf. Adv. Comput. Eng. Commun. Syst., Springer, 2022, pp. 35–48, doi: https://doi.org/10.1007/978- 981-16-7389-4_4

M. V. Raju and M. Sreenivasulu, "A Lightweight Stemmer for Telugu Languag," 4th Int. Conf. Inventive Res. Comput. Appl., IEEE, Sep. 21–23, 2022, pp. 1385–1388, doi: https://doi.org/10.1109/ICIR CA54612.2022.9985623

A. A. Sattar, S. Abbasi, M .U. Rahman, A. Baig, and M. Nizamani, "Sindhi stemmer using affix removal method," Int. J. Adv. Trend. Comput. Sci. Eng., vol. 10, no. 3, pp. 2447–2451, 2021.

H. Kaur and P. K. Buttar, "A rule-based stemmer for Punjabi adjectives," Int. J. Adv. Res. Comput. Sci., vol. 11, no. 6, pp. 15–19, 2020, doi: http://dx.doi.org/10.26483/ijarcs.v11i6.6665

S. Das, R. Pandit, and S. K. Naskar, "A rule based lightweight Bengali stemmer," Proc. 17th Int. Conf. Nat. Lang. Process., 2020, pp. 400–408, doi: https://aclanthology.org/2020.icon-main.55

K. T. P. M. Kariyawasam, S. Senanayake, and P. S. Haddela, "A rule based stemmer for Sinhala language," in 14th Conf. Indust. Info. Sys., IEEE, Dec. 18–20, 2019, pp. 326–331, doi: https://doi.org/10.1109/ICIIS47346.2019.9063286

L. Sarmah, S. K. Sarma, and A. K. Barman, "Development of Assamese rule based stemmer using WordNet," in Proc. 10th Global WordNet Conf., Wrolac, Poland, 2019, pp. 135–--139.

A. A. Suryani, D. H. Widyantoro, A. Purwarianti, and Y. Sudaryat, "The rule-based sundanese stemmer," ACM Trans. Asian Low-Resource Lang. Info. Process., vol. 17, no. 4, pp. 1–28, 2018, doi: https://doi.org/10.1145/3195634

M. Ali, S. Khalid, and H. M. Aslam, "Pattern based comprehensive urdu stemmer and short text classification," IEEE Access, vol. 6, pp. 7374–7389, Dec. 2017, doi: https://doi.org/10.1109/ACCESS.2017.2787798

J. Sheth and B. Patel, "Dhiya: A stemmer for morphological level analysis of Gujarati language," in Int. Conf. Issues and Challeng. Intell. Comput. Techniq., IEEE, Ghaziabad, India, Feb. 7–8, 2014, pp. 151–154, doi: https://doi.org/10.1109/ICICICT.2014.6781269

A. Paul, A. Dey, and B. S. Purkayastha, "An affix removal stemmer for natural language text in nepali," Int. J. Comput. Appl., vol. 91, no. 4, pp. 1–4, 2014.

P. Koirala and A. Shakya, "A Nepali Rule Based Stemmer and its performance on different NLP applications," arXiv preprint arXiv:2002.09901, 2020, doi: https://doi.org/10.48550/arXi v.2002.09901

R. A. Baeza-Yates, "Text- Retrieval: Theory and Practice.," in Proc. IFIP 12th World Comput. Cong. Algorith. Software Architec. Inform. Process '92, 1992, pp. 465–476.

M. Melucci and N. Orio, "A novel method for stemmer generation based on hidden Markov models," in Proc. 12th Int. Conf. Inform. knowledge Manag., 2003, pp. 131–138, doi: https://doi.org/10.1145/9568 63.956889

P. Majumder, M. Mitra, S. K. Parui, G. Kole, P. Mitra, and K. Datta, "YASS: Yet another suffix stripper," ACM Trans. Info. Sys., vol. 25, no. 4, pp. 18–es, Oct. 2007, doi: https://doi.org/10.1145/1281 485.1281489

T. Anzai and A. Ito, "Recognition of utterances with grammatical mistakes based on optimization of language model towards interactive CALL systems," in Proc. of 2012 Asia Pac. Signal Info. Process. Associ. Ann. Summit Conf., California, USA, Dec. 3–6, 2012, pp. 1–4.

A. Ali, A. Hussain, and M. K. Malik, "Model for english-urdu statistical machine translation," World Appl. Sci., vol. 240. no. 10, pp. 1362– 1367, 2013, doi: https://doi.org/10.5829/idosi. wasj.2013.24.10.760

A. Jabbar, S. ul Islam, S. Hussain, A. Akhunzada, and M. Ilahi, "A comparative review of Urdu stemmers: Approaches and challenges," Comput. Sci. Rev., vol. 34, Art. no. 100195, Nov. 2019, doi: https://doi.org/10.1016/j.cosr ev.2019.100195

C. D. Patel and J. M. Patel, "Influence of GUJarati STEmmeR in Supervised Learning of Web PageCategorization," Int. J. Intell. Sys. Appl., vol. 13, no. 3, pp. 23–34, 2021, doi: https://doi.org/10.5815/ijisa.2021.03.03

V. Giri, "MTStemmer: A multilevel stemmer for effective word pre-processing in Marathi," Turk. J. Comput. Math. Educ., vol. 12, no. 2, pp. 1885–1894, 2021, doi: https://doi.org/10.17762/turcomat.v12i2.1527

H. Alshalabi, S. Tiun, N. Omar, F. N. AL-Aswadi, and K. A. Alezabi, "Arabic light-based stemmer using new rules," J. King Saud Univ.-Comput. Info. Sci., vol. 34, no. 9, pp. 6635–6642, Oct. 2022, doi: https://doi.org/10.1016/j.jksuci.2021.08.017

R. Kansal, V. Goyal, and G. S. Lehal, "Rule based urdu stemmer," in Proc. COLING 2012: Demonstration Papers, 2012, pp. 267–276.

S. D. Patel, J. M. Patel, "GUJSTER: a Rule based stemmer using Dictionary Approach," in Int. Conf. Inv. Commun. Comput. Technol., Coimbatore, India, Mar. 10–11, 2017, pp. 496–499, doi:

https://doi.org/10.1109/ICICCT.2017.7975249

A. Mateen, M. K. Malik, Z. Nawaz, H. M. Danish, M. H. Siddiqui, and Q. Abbas, "A hybrid stemmer of punjabi shahmukhi script," Int J Comput Sci Netw Secur, vol. 17, no. 8, pp. 90–97, Aug. 2017.

A. Rahimi, "A new hybrid stemming algorithm for Persian," arXiv preprint arXiv:1507.03077, 2015, doi: https://doi.org/10.48550/arXiv.1507.03077

M. Hadni, S. A. Ouatik, and A. Lachkar, "Effective Arabic stemmer based hybrid approach for Arabic text categorization," Int. J. Data Mining Knowledge Manag. Proc. Acad. Indus. Res. Collabo. Center, vol. 3, no. 4, pp. 1–4, July 2013, doi: https://doi.org/10.5121/ijdkp.2013.3401

K. Suba, D. Jiandani, and P. Bhattacharyya, "Hybrid inflectional stemmer and rule-based derivational stemmer for gujarati," in Proc. 2nd Workshop South Southeast Asian Nat.Language Process., 2011, pp. 1–8.

H. Sak, A. W. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling," Interspeech, pp. 338–342, 2014.

M. S. Islam, S. S. S. Mousumi, S. Abujar, and S. A. Hossain, "Sequence-to-sequence Bangla sentence generation with LSTM recurrent neural networks," Proc. Comput. Sci., vol. 152, pp. 51–58, 2019, doi: https://doi.org/10.1016/j.proc s.2019.05.026

F. Chollet, " A ten-minute introduction to sequence-to-sequence learning in Keras." The Keras Blog. https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html (Accessed Dec. 3, 2022).

Published
2022-12-25
How to Cite
Mubasher H. Malik, Hamid Ghous, Iqra Ahsan, & Maryem Ismail. (2022). Saraiki Language Hybrid Stemmer Using Rule-Based and LSTM-Based Sequence-To-Sequence Model Approach. Innovative Computing Review, 2(2). https://doi.org/10.32350/icr.0202.02