How to Start a Language on Mozilla Common Voice?
A case study for under-resourced Turkish Language
- Track: Mozilla devroom
- Room: D.mozilla
- Day: Saturday
- Start: 14:00
- End: 14:45
- Video with Q&A: D.mozilla
- Video only: D.mozilla
- Chat: Join the conversation!

On Mozilla Common Voice, as of December 2021, there are 154 locales, but only 87 fulfilled the requirements to collect voices, where 27 of them are fairly new. In this two-part presentation, we want to give some starting points for the new language communities, share our accumulated knowledge in the last year while working on the under-resourced Turkish language, with initial training results.
The presentation includes the following topics: Resources on Mozilla Common Voice, how to analyze your dataset, how to set goals, how to design a social media campaign, what tools you can use, Google Colabs, Coqui STT, and our roundups on training Common Voice Turkish Dataset v1 - v7.0, all with our successes and failures as Common Voice Turkish Volunteers group as lessons learned.
- Errata: In the video "checkpoint" is mistakenly written/spoken as "breakpoint", these are corrected in the slides.
- Addendum: Our dataset analysis and training results for the Common Voice v8.0 dataset have been added as new slides and video.
Speakers
![]() |
Bülent Özden |
Attachments
Links
- Mozilla Common Voice
- CV - Sentence Collector
- CV - Discourse
- CV - Pontoon (UI translation)
- CV - Turkish sub-Discourse
- CV - Matrix chat
- CV - Github repositories
- CV - Community Playbook
- Common Voice Utils repo
- Common Voice Docker repo
- Coqui Website
- Github Repo of the Colab Notebooks used in experiments presented
- Video recording(WebM/VP9)
- Video recording(mp4)
- Chat room (web)
- Chat room (app)
- Hallway chat room (web)
- Hallway chat room (app)
- Submit feedback