Unlocking Sindhi Language Data

An Invitation to NLP Researchers & Developers. Sindhi Language Resources Now Open for AI & NLP Innovation & Empowering Sindhi Language Technology (Data Access for Developers).

Empowering Sindhi Language Innovation: Open Data, AI Collaboration, and Linguistic Advancement

Open Access to Sindhi Language Data

To offer researchers and developers free access to a rich repository of Sindhi language datasets, enabling advanced AI, NLP, and digital tool development for linguistic empowerment.

Advancement of Sindhi Language Technologies

To facilitate the growth of Sindhi-focused technologies like speech recognition, OCR, and machine translation by providing curated corpora, grammar-tagged datasets, and structured linguistic resources to developers and institutions.

Collaboration for WordNet and Linguistic AI Tools

To invite partnerships for the creation of Sindhi WordNet and AI-based applications through the use of annotated words, thesauri, and advanced semantic tagging of Sindhi vocabulary and syntax.

Sindhi Language Resources Open for Developers and Researchers

The Abdul Majid Bhurgri Institute of Language Engineering, a pioneering institute under the Culture, Tourism, Antiquities & Archives Department, Government of Sindh, is proud to announce that we have successfully compiled and structured one of the largest Sindhi language corpora and AI-ready resources for public benefit and linguistic advancement.

We now invite NLP engineers, AI researchers, linguists, and software developers working in the domain of Sindhi language technology to collaborate with us and access this valuable data; free of cost, under specific terms and conditions for responsible and impactful usage.

Why Collaborate with Us?

  • Use data to build Text-to-Speech (TTS) and Speech-to-Text (STT) systems

  • Train language models, chatbots, OCR systems, and translators

  • Help bridge the digital divide for Sindhi language users worldwide

  • Develop academic research and open-source tools for Sindhi language processing

Our Sindhi Language Data Highlights

Sindhi language tokens collected from diverse sources

extracted segments, with 2 Million already cleaned and structured

Sindhi audio data curated for speech technology research

images with associated text data for robust OCR training, capable of handling large PDF conversions

words grammar tagged, with ongoing annotation for: Gender Number Tense Synonym & Antonym Hypernym & Hyponym

Preprocessed and Cleaned Sentence Pairs Now Available for Fine-Tuning in Sindhi Language Model Training & Development.

Shah Jo Risalo Corpus & Sindhi POS-Tagged Word Dataset

Access a rich and open-access Sindhi language resource featuring two key datasets:

(1) Shah Jo Risalo – 43,779 meticulously structured poetry of Shah Abdul Latif Bhittai (Shah Jo Risalo)

(2) A Sindhi POS-tagged word dataset containing over 162,000 words in CSV format. Curated for NLP and AI developers, these resources are openly available on GitHub, Hugging Face, Kaggle, GitLab, and the Internet Archive, enabling global advancement in Sindhi computational linguistics.

Hugging Face Datasets

Harvard University Dataverse

GitHub Repositories

Kaggle (Google Datasets)

Updated Linguistic Resource Metrics – 28th August, 2025

For individuals: Apply, Collaborate & Contribute to Sindhi Language Innovation

Interested in Accessing the Data?

AMBILE invites AI developers, researchers, academic institutions, and open-source collaborators to access its extensive Sindhi language resources. To obtain the data, submit a brief proposal outlining your intended use. Access is free under a non-commercial open data license with a signed usage agreement to ensure responsible use.

How to Apply for Access

Email your request to contact@ambile.pk or submit the online form below. Please include your name or organization, purpose of use, a short project summary, and any relevant background. Our team will review and respond to guide the next steps.

Terms and Conditions

Data is provided free for non-commercial research and educational use. Proper acknowledgment of AMBILE is required. Redistribution or resale is prohibited. The institute reserves the right to assess all requests for appropriateness and compliance.

The Institute is developed under umbrella of Culture Department, Government of Sindh for development and integration of Sindhi language technology, ensuring its preservation and promotion in the digital age. The institute aims to lead in research, innovation, and the creation of linguistic resources, fostering a technologically empowered future for the Sindhi language.

Contact us

info@ambile.pk

+92 (22) 924-0290

Behind Sindh Museum
N-5, Hyderabad – Sindh