Towards Sindhi Corpus Construction

  • Mutee U Rahman Department of Computer Science, Isra University - Hyderabad, Pakistan
Keywords: corpus construction, unigram, bigram, trigram frequencies orthography, script

Abstract

Abstract Views: 136

The paper discusses the current state of Sindhi corpus construction in detail. Sindhi corpus development issues including corpus acquisition, preprocessing, and tokenization are discussed in detail. Preliminary results and observations which include letter unigram, bigram and trigram frequencies; word frequencies and word bigram frequencies are presented. Current state of Sindhi corpus with its limitations and future work is also discussed. The paper also explores the orthography and script of Sindhi language with reference to corpus development.

Downloads

Download data is not yet available.
Published
2015-03-31
How to Cite
Mutee U Rahman. (2015). Towards Sindhi Corpus Construction. Linguistics and Literature Review, 1(1), 39- 48. https://doi.org/10.32350/llr/11/04
Section
Articles