MIM-GOLD



MIM-GOLD is a corpus containing one million words of text. The texts were tagged automatically and the tags were than manually corrected. The texts in the MIM-GOLD corpus were sampled from the texts of the MIM corpus. For the use of MIM-GOLD the same license as for the MIM corpus is therefore valid. The MIM-GOLD corpus is intended as a gold standard for the training of data-driven PoS taggers.


USE THE MIM-GOLD corpus1

Here you can download the MIM-GOLD corpus. Distributed with the MIM user license.

About the MIM-GOLD corpus
The MIM-GOLD corpus consists of 13 files with tagged Icelandic text that has been sampled from 13 of 23 domains of texts of the 25 million word Tagged Icelandic Corpus (MIM). The texts were cleaned extensively and then run through an automatic tagging process consisting of five taggers and a voting scheme (Loftsson et al, 2010). The tags were then manually corrected. The files are delivered as is after the first manual correction process which has not been validated. The accuracy of tagging after the first manual correction phase has not been estimated. The corpus is intended for the training of data-driven taggers for Icelandic.

The tagset used is the same as is used for the Tagged Icelandic Corpus. The text files are in Linux format and coded in UTF-8 code page. The format of the files is one token per line, each line consists of the word, followed by a tab and then the tag. Sentences are separated by empty lines.

Here is a list of files in the corpus. The corpus is distributed with the MIM user license since the texts in MIM-GOLD were sampled from the MIM corpus.

Now (December 2013) work is in progress to validate the tags in MIM-GOLD and to estimate the tagging accuracy.



1When publishing results based on the texts in MIM-GOLD please refer to:

Hrafn Loftsson, Jökull H. Yngvason, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2010. Developing a PoS-tagged corpus using existing tools. Sarasola, Kepa, Francis M. Tyers og Mikel L. Forcada (eds.): 7th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages, LREC 2010, pp. 53-60. Valetta, Malta.



Contact
Sigrún Helgadóttir
Project manager
Árni Magnússon Institute for Icelandic Studies
Office: Neshaga 16
Work phone: +354-525-4434
Fax: +354-562-7242
E-mail: sigruhel@hi.is
Web page: https://notendur.hi.is/sigruhel/


References
Hrafn Loftsson, Jökull H. Yngvason, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2010. Developing a PoS-tagged corpus using existing tools. Sarasola, Kepa, Francis M. Tyers og Mikel L. Forcada (eds.): 7th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages, LREC 2010, pp. 53-60. Valetta, Möltu.