The Malromur Corpus





The Malromur corpus is an open source corpus of Icelandic voice samples.



Use Malromur1

Malromur - Recordings (WAV - 9.8 GB). [256 Kbps, 16 KHz] CC BY 4.0 license
About the Malromur corpus
Reykjavík University and The Icelandic Centre for Language Technology collected data for an Icelandic speech corpus in collaboration with Google. The data is available on this webpage for everybody and this is a good opportunity to develop language technology tools for Icelandic such as a speech recognizer. Voice samples from 563 individuals were recorded with Android G1 smart-phones, a total of 152 hours of speech. In total 127,286 voice samples were recorded. Of those 108,568 were considered useful and 18,718 were discarded. The 108,568 good voice samples can be downloaded from this webpage.

About the project
The Almannarómur project was performed during the years 2011 and 2012. Under the auspices of Google work was performed at that time to collect voice samples for various languages in order to develop speech recognition and make the data available for research and the development of language technology tools. The goal of the Almannarómur project was to develop a database of spoken sentences to aid the development of automatic speech recognition for Icelandic. The database can also be used in the development of many other types of spoken language technologies.

Google cooperated with Reykjavík University and The Icelandic Centre for Language Technology in collecting voice samples for Icelandic. During the first phase of the project a Text Corpus with sentences was generated. About 50% of the text in the corpus is news stories from the website mbl.is (website of the newspaper Morgunblaðið), 10% is rare tri-phones, 10% is names of streets, 10% is names of people, 10% is miscellaneous, 5% is names of countries and capitals and 5% is URLs. The corpus contains 55,000 sentences. A list containing numbers, dates, times of day, names of days and months, simple questions, and common greetings was also included in the corpus.

Headlines were extracted from the text obtained from mbl.is and then the text was processed by the IceNLP sentence segmentizer in order to obtain a complete sentence list. The length of each sentence was limited to 6 words, in order to make reading easier and to ensure that the sentence would fit on the screen of the Android G1 device. Each sentence was checked for spelling, using the Database of Modern Icelandic Inflection (BÍN). Any sentences containing words not found in the dictionary were deleted from the final list. Sentences were then ordered randomly to ensure that the sample of sentences that each participant was to read was representative for the text in the corpus.

The data was recorded using Android G1 smart-phones. Each participant was asked to read for 30 minutes or up to 250 utterances. The people donating their voice were non-paid participants of the project and signed a special agreement about the use of the voice samples in spoken language technologies operated by Google and other spoken language tools. Google provided 10 Android G1 smart-phones that were used in the project.

The voice samples were collected in three phase. The first phase started on July 15 2011. Ten volunteers each received smart-phones and had the responsibility of getting participants, i.e. asking them to donate a voice sample by reading sentences for 30 minutes. This phase ended in August and the approach was not as effective as anticipated. It turned out to be hard to get people to volunteer. The volunteers that did help out also had a hard time getting participants. The total number of people participating in this phase was 59. The second phase was carried out in September and October and was based on organized events around the data collecting effort. Series of events were advertised within the universities (Reykjavík University and University of Iceland) where two to three volunteers collected voice from participants, using all 10 phones. This approach lasted for 4 weeks and was considerably more effective than the first approach, as 104 people participated in the project. The last phase was carried out in November 2011 to January 2012 and was based on organized visits to companies and institutions. The preparation for this phase took some time as key individuals in the workplace were identified and approached and asked to organize the data collection. Each workplace received a set number of smart-phones for a set number of days. The phones were then sent to the next workplace. Two to five volunteers were recruited and the duration of the collection was deliberately kept low, usually three to four days. The total number of workplaces visited was 19 and the total number of participants in this phase was 430. The total number of read sentences was thus 123.227 from 593 individuals.

A client-software was set up on the smart-phones that enabled downloading of Icelandic utterances and the uploading of speech recordings. Google technical staff used the voice samples together with other Icelandic language resources (large text corpora to make a language model) to develop a speech recognizer for Icelandic for Android smart-phones and the Google search engine. These tools were announced in the fall of 2012.

The data opened
It was decided to make the database with the voice samples open source to be used for the development of speech recognizers and other speech technology tools. To make the voice samples as useful as possible it was considered necessary to validate them. In the summer of 2014 a student at the University of Iceland listened to 69 thousand voice samples to determine whether the spoken text agreed with the text to be read. At the end of the summer 57 thousand voice samples had been validated to be good and were made available on this webpage. During the summer of 2015 another student listened to more voice samples and during the year 2016 employees at the Árni Magnússon Institute for Icelandic Studies finished listening to the voice samples.

In total 127,286 voice samples were recorded, failed recordings were 5,401 resulting in 121,885 voice samples that were evaluated. Before the verification process started new sound files were created by trimming long periods of silence at the beginning and end of the recordings. The total duration of the untrimmed files is about 152 hours but was reduced to about 90 hours. During this process 2,795 files were identified as silent. In the first stage of the verification process therefore 119,090 voice samples were evaluated. 100,020 recordings were accepted as correct, and 19,070 were rejected. During the second stage in the winter of 2016–2017 two evaluators listened to untrimmed versions of the 19,070 recordings that were rejected in stage one and classified them further. Of these samples 8,548 were classified as correct. In total it is considered that 108,568 voice samples are good and are available through this webpage.

Four evaluators listened to 3000 voice samples selected randomly from all samples evaluated in the first stage. All evaluators listened to all the 3000 samples. Results are in line with results obtained during the second stage of the verification process.

An Icelandic NGO, Almannrómur, was established June 5th 2014. The aim of the NGO is to develop language technology tools for Icelandic. The database made available here has therefore been given the name Málrómur (“voice”).

For further information see the articles Almannarómur: An Open Icelandic Speech Corpus and Málrómur: A Manually Verified Corpus of Recorded Icelandic Speech (see References).



1When publishing results based on data in the Málrómur database please refer to: :

Jón Guðnason, Oddur Kjartansson, Jökull Jóhannsson, Elín Carstensdóttir, Hannes Högni Vilhjálmsson, Hrafn Loftsson, Sigrún Helgadóttir, Kristín M. Jóhannsdóttir og Eiríkur Rögnvaldsson. 2012. Almannarómur: An Open Icelandic Speech Corpus. Proceedings of SLTU ’12, 3rd Workshop on Spoken Languages Technologies for Under-Resourced Languages, Cape Town, Suður-Afríku.

Steinþór Steingrímsson, Jón Guðnason, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2017. Málrómur: A Manually Verified Corpus of Recorded Icelandic Speech. Proceedings of the 21st Nordic Conference of Computational Linguistics (NODALIDA 2017). Gothenburg, Sweden, May 2017. Linköping University Press. Sweden.


Contact
Steinþór Steingrímsson
Project manager
Stofnun Árna Magnússonar í íslenskum fræðum
Address: Laugavegur 13
Work: +354-525-4438
e-mail: steinthor.steingrimsson@arnastofnun.is


References
Jón Guðnason, Oddur Kjartansson, Jökull Jóhannsson, Elín Carstensdóttir, Hannes Högni Vilhjálmsson, Hrafn Loftsson, Sigrún Helgadóttir, Kristín M. Jóhannsdóttir og Eiríkur Rögnvaldsson. 2012. Almannarómur: An Open Icelandic Speech Corpus. Proceedings of SLTU ’12, 3rd Workshop on Spoken Languages Technologies for Under-Resourced Languages, Cape Town, Suður-Afríku.

Steinþór Steingrímsson, Jón Guðnason, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2017. Málrómur: A Manually Verified Corpus of Recorded Icelandic Speech. Proceedings of the 21st Nordic Conference of Computational Linguistics (NODALIDA 2017). Gothenburg, Sweden, May 2017. Linköping University Press. Sweden.