ym88659208ym87991671
Golos dataset | Documentation for developers

Golos dataset

Last updated on March 27, 2024

You can read the Russian version of the documentation.

Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus.

Dataset structure

DomainTrain filesTrain hoursTest filesTest hours
Crowd979 7961 0959 99411.2
Farfield124 003132.41 9161.4
Total1 103 7991 227.411 91012.6

Downloads

Audio files in opus format

ArchiveSizeLink
golos_opus.tar20.5 Gbhttps://sc.link/JpD 

Audio files in wav format

Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed in the table:

ArchivesSizeLinks
train_farfield.tar15.4 Gbhttps://sc.link/1Z3 
train_crowd0.tar11 Gbhttps://sc.link/Lrg 
train_crowd1.tar14 Gbhttps://sc.link/MvQ 
train_crowd2.tar13.2 Gbhttps://sc.link/NwL 
train_crowd3.tar11.6 Gbhttps://sc.link/Oxg 
train_crowd4.tar15.8 Gbhttps://sc.link/Pyz 
train_crowd5.tar13.1 Gbhttps://sc.link/Qz7 
train_crowd6.tar15.7 Gbhttps://sc.link/RAL 
train_crowd7.tar12.7 Gbhttps://sc.link/VG5 
train_crowd8.tar12.2 Gbhttps://sc.link/WJW 
train_crowd9.tar8.08 Gbhttps://sc.link/XKk 
test.tar1.3 Gbhttps://sc.link/Kqr 

Acoustic and language models

Acoustic model built using QuartzNet15x5  architecture and trained using NeMo toolkit .

Three n-gram language models created using KenLM Language Model Toolkit .

ArchivesSizeLinks
QuartzNet15x5_golos.nemo68 MBhttps://sc.link/ZMv 
KenLMs.tar4.8 Gbhttps://sc.link/YL0 

Golos data and models are also available in the hub of pre-trained models, datasets, and containers — DataHub ML Space. You can train the model and deploy it on the high-performance SberCloud infrastructure in ML Space  — full-cycle machine learning development platform for DS-teams collaboration based on the Christofari Supercomputer.

Evaluation

Percents of Word Error Rate for different test sets.

Decoder / Test setCrowd testFarfield testMCV devMCV test
Greedy decoder4.389%14.949%9.314%11.278%
Beam Search with Common Crawl LM4.709%12.503%6.341%7.976%
Beam Search with Golos train set LM3.548%12.384%
Beam Search with Common Crawl and Golos LM3.318%11.488%6.4%8.06%

MCV — Mozilla Common Voice  — Mozilla's initiative to help teach machines how real people speak.

Resources

Golos: Russian Dataset for Speech Research .

License

Public license.

Creators

  • Aleksandr Denisenko.
  • Angelina Kovalenko.
  • Nikolaj Karpov.
  • Fedor Min'kin.

Contacts

Ask your questions by email SmartSpeech@sberbank.ru.

Sber process cookies only to personalize services according to Cookies Usage Policy. You can prevent the processing of cookies in your browser settings.