Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech

Abstract

Kashmiri is spoken by around 7 million people, but remains critically underserved in speech technology. despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (\texttt{TTS}) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated, open-source neural \texttt{TTS} system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The models's vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages.

Audio Samples

These audio samples were synthesized by Bolbosh using our two high-quality Rasa speakers. Generation was performed using a Conditional Flow Matching (CFM) decoder and synthesized into waveforms via a HiFi-GAN vocoder.

Male Speaker (Speaker ID: 422)

مےٛ چھُ آز سُلۍ وۄستادَس نِش گَژھُن، سُہ پرناوِ مےٛ آز بَدل کانٛہہ سبَق۔

بہٕ ییٛلہِ آزٕ صُبحٲے دراس، اَتہِ ٲسۍ واریاہ لُکۍ درامٕتۍ چَکرَس۔

مےٛ رۄٛن پنٕنہِ وارِ ہُنٛد حاکھ ۔ أتھ گوو سیٹھا مزٕ ژٕتہِ وُچھُس نوٗن کٮ۪وٛتھ چُھس۔

سُہ گوو ٲدۍ کانٛدرَس نِش ژۄٹ اننہٕ۔ واریاہ کالہٕ گووُس گمتٕس مَگر وٕنہِ آو نہٕ کینٛہہ۔

سۄ سَمکھٔے نہٕ تتہِ کہیٖنۍ۔ سُۄ ٲسۍ گٔمٕژ ماتامال راتس روزنہٕ خٲطرٕ۔

گوٗر ییہِ صُبہٕچہِ دہہِ بجہٕ دۄدھ ہیتھ ۔ تٔمِس چھُ آسان کٔمٕے دۄدھ، یِنہٕ ژے دۄدھ رٹُن مشِتھ گژھی۔

بہٕ گووُس راتھ پنٕنِس باغس منٛز۔ تتہِ کھیُو مےٛ صُبحٲے أکھ ژوٗنٹ پتہٕ کھیم شامس زٕ ژوٗنٹۍ۔

سٲنِس علاقس منٛزبنٲوِکھ زٕ نٔے سکوٗل تہٕ بیٚیہِ أکھ کالیج کینٛہہ وقھت برونٛٹھ ۔

آز چھُ یتھ اوبُر تہٕ تٖرتہِ چھِ واریاہ زیادٕ بہٕ گژھنہٕ امہِ موٗجوٗب آز نیبر کینٛہہ۔

تاکھچَس پٮ۪ٹھ تھٔومٕتِس یِتھ سامانس چٕھ واریاہ گرٕد کھٔژمٕژ، أتھ دِتہٕ ہایہِ ژٕنڈ۔

Female Speaker (Speaker ID: 423)

میانہٕ بینہٕ چھُ لۄکٕٹۍ لۄکِٹۍ شُرۍ، سُۄ چھِ تِمن سۭتۍ دۄہس آوٕر آسان۔

دۄہَس اُوسُس بہٕ کٲم کران کارخانَس منٛزٕ۔ بہٕ تھۄکُس سٮ۪ٹھا۔ وونۍ کَڑٕ بہٕ تَھکھ تامَتھ ۔

مےٛ دیُوت تٔم کرِہُن شال کلس دِنہٕ خٲطرٕ، مگر مےٛ گوو سُہ خبَر کیوٚتھ۔ بہٕ لاگنہٕ سُہ کینہہ۔

ژےٛ گومُت چھُے پران پران أچھَن وۄزجار۔ یِم چھَلُکھ تٕرنہِ آبہٕ سۭتۍ یِم گژھنۓ ٹھیٖکھ ۔

بہٕ اوسُس تٕمن دۄشوِنی دارِپیٹھ بِہِتھ تامشہٕ وُچھان زٕ یِم ہیکنہٕ یمہٕ سارۓ کٔنہِ تُلِتھ ۔

پتِمہِ ریتہٕ اوس واریاہس کالس روٗد پیوان، تہٕ شیٖن تہِ پیو ورایاہ، مگریمہٕ ریتہٕ پیو نہٕ وٕنہِ کینٛہہ۔

راتھ شامَس اوس کُس تام نفر میٲنِس مٲلِس ژھانڈان، بہٕ زانہن نہٕ کینٛہہ سُہ کُس اوس۔

دَپان چھٕ شامَس گژھِ بتہٕ سُلی سُلی کھیون، امہِ سۭتۍ چھُ زُو ٹھیٖکھ روزان۔

تٔمِس اوس اوترٕ کُس تام پرژھان زِ سُہ کتہِ چھُ روزان۔

راتَس دۄہَس چھٕ یمِ شُرۍ آسان گِندان تہٕ نَژان، تۄتہِ چھےٛ نہٕ أمہٕ تَھکان۔

Model Architecture Overview

Bolbosh replaces standard phonetic processing with robust character-level representations suitable for Kashmiri orthography, utilizing the Conditional Flow Matching (CFM) objective proposed in Matcha-TTS.

Key Components

Text Encoder: 6-layer Transformer with Rotary Positional Embeddings (RoPE), 2 attention heads, 192 hidden channels.
Text Processing: Custom KashmiriNormalizer handling Unicode canonicalization, diacritic preservation, and Plat Ye corrections.
Vocabulary: 272 character tokens bypassing any phonetic tool (phonemizer-free).
Duration Predictor: 2-layer CNN estimating character timing.

Synthesis Engine

Flow Matching Decoder: 1-D U-Net using SnakeBeta activation, predicting the optimal vector field.
Speaker Conditioning: 64-dimensional embeddings (learned jointly over 424 speakers).
Vocoder: HiFi-GAN universal v1 reconstructing waveforms from 80-bin mel-spectrograms.

Evaluation Results

Mean Opinion Score (MOS) Results for Kashmiri TTS Systems with 95% Confidence Intervals.

System	MOS (↑)	95% CI
Human (Ground Truth)	4.614	± 0.059
Bolbosh (Ours)	3.634	± 0.061
IndicParler (Baseline)	1.864	± 0.065

TTS Benchmark Results. MCD, rWER, and WER (↓) lower is better. Disclaimer: We report rWER (Relative Word Error Rate w.r.t ASR model) to account for the high WER of the ASR model (IndicConformer; WER: 41.2 % [No Diacritics], 66.59 % [With Diacritics]).

Model	Condition	MCD	rWER (%)	WER
Bolbosh	With Diacritics	3.73	4.14	0.6935
Bolbosh	No Diacritics	–	13.23	0.4665
IndicParler	With Diacritics	4.73	46.75	0.9772
IndicParler	No Diacritics	–	100.32	0.8253

Mel-Spectrogram Comparisons

Comparison of Mel-Spectrograms between Ground Truth (Human), Bolbosh (Ours), and IndicParler.

BibTeX

@inproceedings{ashraf2026bolbosh, title={Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech}, author={Ashraf, Tajamul and Zargar, Burhaan Rasheed and Muizz, Saeed Abdul and Mushtaq, Ifrah and Mehdi, Nazima and Gillani, Iqra Altaf and Kak, Aadil Amin and Bashir, Janibul}, booktitle={Proceedings of Interspeech}, year={2026}, url={https://github.com/gaash-lab/Bolbosh} }