Развернуть

Golos dataset

Last updated on March 27, 2024

You can read the Russian version of the documentation.

Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus.

Dataset structure

Domain	Train files	Train hours	Test files	Test hours
Crowd	979 796	1 095	9 994	11.2
Farfield	124 003	132.4	1 916	1.4
Total	1 103 799	1 227.4	11 910	12.6

Downloads

Audio files in opus format

Archive	Size	Link
golos_opus.tar	20.5 Gb	https://sc.link/JpD

Audio files in wav format

Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed in the table:

Archives	Size	Links
train_farfield.tar	15.4 Gb	https://sc.link/1Z3
train_crowd0.tar	11 Gb	https://sc.link/Lrg
train_crowd1.tar	14 Gb	https://sc.link/MvQ
train_crowd2.tar	13.2 Gb	https://sc.link/NwL
train_crowd3.tar	11.6 Gb	https://sc.link/Oxg
train_crowd4.tar	15.8 Gb	https://sc.link/Pyz
train_crowd5.tar	13.1 Gb	https://sc.link/Qz7
train_crowd6.tar	15.7 Gb	https://sc.link/RAL
train_crowd7.tar	12.7 Gb	https://sc.link/VG5
train_crowd8.tar	12.2 Gb	https://sc.link/WJW
train_crowd9.tar	8.08 Gb	https://sc.link/XKk
test.tar	1.3 Gb	https://sc.link/Kqr

Acoustic and language models

Acoustic model built using QuartzNet15x5 architecture and trained using NeMo toolkit .

Three n-gram language models created using KenLM Language Model Toolkit .

LM built on Common Crawl Russian dataset.
LM built on Golos train set.
LM built on Common Crawl and Golos datasets together (50/50).

Archives	Size	Links
QuartzNet15x5_golos.nemo	68 MB	https://sc.link/ZMv
KenLMs.tar	4.8 Gb	https://sc.link/YL0

Golos data and models are also available in the hub of pre-trained models, datasets, and containers — DataHub ML Space. You can train the model and deploy it on the high-performance SberCloud infrastructure in ML Space — full-cycle machine learning development platform for DS-teams collaboration based on the Christofari Supercomputer.

Evaluation

Percents of Word Error Rate for different test sets.

Decoder / Test set	Crowd test	Farfield test	MCV dev	MCV test
Greedy decoder	4.389%	14.949%	9.314%	11.278%
Beam Search with Common Crawl LM	4.709%	12.503%	6.341%	7.976%
Beam Search with Golos train set LM	3.548%	12.384%	—	—
Beam Search with Common Crawl and Golos LM	3.318%	11.488%	6.4%	8.06%

MCV — Mozilla Common Voice — Mozilla's initiative to help teach machines how real people speak.

Resources

Golos: Russian Dataset for Speech Research .

License

Public license.

Creators

Aleksandr Denisenko.
Angelina Kovalenko.
Nikolaj Karpov.
Fedor Min'kin.

Contacts

Ask your questions by email SmartSpeech@sberbank.ru.

Dataset structure﻿

Downloads﻿

Audio files in opus format﻿

Audio files in wav format﻿

Acoustic and language models﻿

Evaluation﻿

Resources﻿

License﻿

Creators﻿

Contacts﻿