Saraiki Language Hybrid Stemmer Using Rule-Based and LSTM-Based Sequence-To-Sequence Model Approach

Converting a word to its original form, is called stemming, which is extremely important in the field of Natural language processing (NLP). It’s an integral part of the linguistic pre-processing of every Natural language processing application. Stemming converts inflectional word forms into their root word. Much work has been done for stemming in different national and regional languages like English, French, Arabic, German, Urdu, and Hindi. Many regional languages still need work to build digital resources using Natural language processing. Saraiki is one of the widely spoken regional languages in Pakistan. Almost eighty million people use this language for communication. There are very limited digital resources using the Saraiki language available to support advancement in Natural language processing technologies. The current research aims to propose a hybrid stemmer to stem Saraiki Work. The hybrid stemmer contains two hundred prefix and postfix rules and Long short-term memory based sequence-to-sequence model for converting Saraiki words into the stem. Firstly, Saraiki text * Corresponding Author: [email protected] was pre-processed, and a rule set was implemented. Secondly, the Long short-term memory based sequence-to-sequence model was deployed to stem the Saraiki word correctly. In the last step, The Saraiki Stemmer performance was evaluated by accurately finding stem word accuracy using a rule-set and Long short-term memory sequence to sequence model. After experiments, using the rule set correctly, stem word accuracy was 68.53%, while the Long short-term memory based sequence-to-sequence model produced 93.0% accuracy of correctly stem words. This work contributes significantly to the regional linguistic field by introducing stemmer for the Saraiki language.
Copyright (c) 2022 Mubasher H. Malik, Hamid Ghous, Iqra Ahsan, Maryem Ismail

This work is licensed under a Creative Commons Attribution 4.0 International License.