Compilation of Sindhi Corpora initiated

On April 25th, 2022, the Abdul Majid Bhurgri Institute of Language Engineering commenced the data processing phase for the collection and compilation of Sindhi corpora. Our initial efforts focused on downloading text-based content from websites such as Daily Awami Awaz, Sindh Salamat Forum, Sindhi Adabi Board, and research papers from the Sindhi Language Authority, as well as scraping data from Facebook and Twitter.

After processing all the collected data, we have compiled a total of 152,000 entries. To further enrich this dataset, we have invited Sindhi social media users to sign up and create accounts on the Sindhi Corpus Portal. We encourage them to contribute words from their own dialects and folk language that are not yet included in the existing dataset.

This initiative aims to create a comprehensive collection of Sindhi words, which will serve as a vital resource for the upcoming Sindhi WordNet System, a key project currently in the Institute’s pipeline.

Share This