Towards Sindhi Corpus Construction

Authors

  • Mutee U Rahman Department of Computer Science, Isra University - Hyderabad, Pakistan

DOI:

https://doi.org/10.32350/llr/11/04

Keywords:

corpus construction, unigram, bigram, trigram frequencies orthography, script

Abstract

The paper discusses the current state of Sindhi corpus construction in detail. Sindhi corpus development issues including corpus acquisition, preprocessing, and tokenization are discussed in detail. Preliminary results and observations which include letter unigram, bigram and trigram frequencies; word frequencies and word bigram frequencies are presented. Current state of Sindhi corpus with its limitations and future work is also discussed. The paper also explores the orthography and script of Sindhi language with reference to corpus development.

Downloads

Download data is not yet available.
137

Downloads

Published

2015-03-31

How to Cite

Mutee U Rahman. (2015). Towards Sindhi Corpus Construction. Linguistics and Literature Review, 1(1), 39–48. https://doi.org/10.32350/llr/11/04

Issue

Section

Articles

Similar Articles

1 2 > >> 

You may also start an advanced similarity search for this article.