MIM-GOLD



MIM-GOLD is a corpus containing one million words of text. The texts were tagged automatically and the tags were than manually corrected. The texts in the MIM-GOLD corpus were sampled from the texts of the MIM corpus. For the use of MIM-GOLD a special license based on the MIM license is therefore valid. The MIM-GOLD corpus is intended as a gold standard for the training of data-driven PoS taggers.


USE THE MIM-GOLD corpus1

Here you can download the MIM-GOLD corpus, version 0.9. Distributed with a special user license.
Here you can download the MIM-GOLD corpus, version 1.0. Distributed with a special user license.
Here you can download training/testing sets for version 1.0 og MIM-GOLD. Distributed with a special user license.



THIS PAGE IS BEING REVISED

About the MIM-GOLD corpus
The Tagged Icelandic Corpus (MÍM) was published in 2013. The corpus contains abouot 25 million running words of texts written during the first decade of the 21st century.

While MÍM was being compiled about one million tokens were sampled from 13 of the 23 23 domains of MÍM. The new corpus should replace the corpus of the Icelandic Frequency Dictionary (IFD) as a gold standard for the training of data-driven taggers for Icelandic.

In 2013 version 0.9 of MIM-GOLD was published. Now (2018) version 1.0 is published. The development of MIM-GOLD is described below. The process is divided into 5 phases, numbered 0 to 4.

Phase 0

Work on MIM-GOLD commenced in the summer of 2009 when a grant was secured from the Student Innovation fund2 to hire a student to start the project. The texts were sampled at the Árni Magnússon Institute for Icelandic Studies and the student under the supervision of Hrafn Loftsson at Reykjavík University developed a system for tagging the texts. The texts were tokenized with a tokenizer that is a part of the IceNLP system. The texts were then tagged with five taggers: fnTBL, MXPOST, IceTagger, Bidir and TnT (Hrafn Loftsson et al., 2010). The tool CombiTagger was then used to vote between the proposed tags. A method was used that chooses the tag that most taggers suggested. The taggers were trained on the corpus of the Icelandic Frequency Dictionary (IFD). The tagset of the IFD was therefore used.

During the winter of 2009-2010 a search for systematic errors in the MIM-GOLD corpus was performed. Noun phrase (NP), prepositional phrase (PP) and verb phrase (VP) error detection programs described by Loftsson (2009) were used. A large proportion of the errors detected were checked manually and errors corrected. Tagging accuracy was then estimated by inspecting every 100th word. A tag is correct if the whole tagstring (consisting of up to 6 characters) is correct. Mean tagging accuracy was estimated as 92.3%, ranging between 87.6 and 95.5% depending on text domain (Hrafn Loftsson et al., 2010). This part of the project also recieved a contribution from a grant form the Icelandic Research Fund3.

Phase 1

During the summer of 2010 another grant was secured from the Student Innovation fund4 to employ a student to manually check and correct tags of all the words in MIM-GOLD. Thew first job was to finish checking errors found during Phace 0 that had not been corrected (texts from Morgunblaðið). Work on checking texts from printed books was also started. The student was then hired part time during term time and during 2010-2011 all the words in MIM-GOLD were manually checked and corrected. Version 0.9 of MIM-GOLD that was made available on this website in 2013 contains the files after this correction phase. Mean accuracy was estimated as before by inspecting the tag for every 100th word. Mean accuracy was estimated as 96.4%, ranging between 89.9% and 98.5% depending on text domain (Sigrún Helgadóttir et al., 2014). The project also got a contribution from META-NORD5 and the Ministsry of Education Science and Culture6.

Phase 2

The next correction phase started at the end of 2012. The corpus was first tagged automatically with the tagger IceTagger which is a part of the IceNLP software. A script was written that compares the tags output by IceTagger with the (presumed) correct tags in the corpus. If a difference is found the discrepancy was marked as and error candidate. A second student was employed during the summer of 2013 and part-time after that to inspect manually the error candidates. For each error candidate, the student was instructed to i) select the tag in the corpus; or ii) select the tag prposed by IceTagger; or iii) select a new correct tag when neither IceTagger nor the corpus contained the correct tag. After about 80% of the texts had been checked and corrected tagging accuracy was estimated as 99.6%, ranging between 99.5 and 100.0% depending on text domain (Sigrún Helgadóttir et al., 2014). One more student was emplyed in late 2013 to finish checking and correcting the tags. That work was finished in 2014. Tagging accuracy was not estimated at the end of this phase. This part of the project was supported in part by META-NORD5 and the Ministsry of Education Science and Culture6.

Phase 3

Steinþór Steingrímsson, Sigrún Helgadóttir and Eiríkur Rögnvaldsson experimented in 2015 with training the tagger Stagger (Robert Östling, 2012) on the IFD and MIM-GOLD (Steinþór Steingrímsson et al., 2015). Hrafn Loftsson and Robert Östling experimented in 2013 with developing a tagger for Icelandic by training and testing Stagger on the IFD and obtained 92,82% accuracy (Hrafn Loftsson and Robert Östling, 2013). Since this was the best result obtained so far with tagging Icelandic text it was decided to test Stagger on MIM-GOLD. By comparing the accuracy obtained when training and testing Stagger on the IFD and on MIM-GOLD it was clear that there were still a number of inconsistencies and incorrect tags in MIM-GOLD. (Steinþór Steingrímsson et al., 2015). For the experiment a version of MIM-GOLD after the completion of Phase 2 was used. By applying ten-fold cross-validation 92.76% accuracy was obtained for MIM-GOLD. The experiment with training and testing Stagger on IFD reported by Loftsson and Östling (2013) was repeated for MIM-GOLD by using linguistic features (LF) and the unknown word guesser IceMorphy (part of the IceNLP software). An extended lexicon based the Database of Icelandic Inflection (BÍN) was added. As a result of this outcome it was decided to work further on reducing the number of errors and inconsistencies in MIM-GOLD. Lists of inconistencies and errors were made and students were employed to check them manually. The tagset was also modified slightly. Work on this phase was completed in 2017. This part of the project was funded by the Institute of Linguistics at the University of Iceland7 and the Icelandic Ministry of Education Science and Culture8.

Phase 4

Starkaður Barkarson obtained the data of the MIM-GOLD after Phase 3 was completed and trained Stagger on the texts (Starkaður Barkarson, 2017). Tagging accuracy was not estimated after Pase 3 by inspecting a sample as had been done after previous correction phases. Starkaður repeated the experiment performed by Steinþór Steigrímsson, Sigrún Helgadóttir and Eiríkur Rögnvaldsson in 2015. He performed a comparable ten-fold cross-valdiation on MIM-GOLD and obtained 92.74% accuracy.

Despite the corrections made to MIM-GOLD tagging accuracy did not seem to increase. To make sure that the experiments were completely comparable the experiment performed by Steinþór Steingrímsson and his colleagues (Steinþór Steingrímsson et al., 2015) was repeated as far as possible. Same version of MIM-GOLD (before Pase 3) was used and same division into training and testing sets. Data for Database of Icelandic Inflection (BÍN) were not completely comparable since now a later version was used. Starkaður obtained 92.41% accuracy by using BIN and IceMorphy as compared to 92.76% in the experiment performed by Steinþór Steingrímsson and colleagues. Starkaður therefore claims that corrections made to MIM-GOLD resulted in an increase in accuracy of 0.30 percentage points. He believes that the reason for the difference may be found in the set of words and word endings that was available to IceMorphy since there is a large difference in accuracy of unknown words (just under 15%) but a small difference in accuracy of known words (0.09%) (Starkaður Barkarson, 2017).

Modified tagset

To simplify grammatical analysis and reduce inconsistencies in tagging the tagset of the IFD was slightly modified during correction phases of MIM-GOLD. These changes were made:
  • Foreign names were originally tagged as proper nouns. During Phase 3 they were tagged as foreign words (e). (Steinþór Steingrímsson et al., 2015).
  • In the IFD, function words preceding were classified as adverbs (aa). During Pahse 2, on the other hand, they are classified as prepositions if they are followed by a complement clause. Thus, the word til in the sentence „Hann hljóp til að komast fyrr heim“ is classified as a preposition governing genitive case (ae). (Sigrún Helgadóttir et al., 2014; Steinþór Steingrímsson et al., 2015; Starkaður Barkarson, 2017).
  • Further classification of proper nouns was ended during Phase 3. Tags of all proper nouns now end in -s, instead of -m (person names), (place names) and -s (other proper nouns). Number of tags is reduced by 68. (Steinþór Steingrímsson et al., 2015).
  • During Phase 3 v was adopted as a tag for e-mail addresses and web addresses. (Steinþór Steingrímsson et al., 2015).
  • During Phase 3 as was adopted as a tag for abbrevitaions. In the IFD tagset abbreviations were broken up into individual words and each letter tagged as the word it stood for. (Steinþór Steingrímsson et al., 2015).
  • During Phase 3 it was decide that all number constants that were tagged as cardinals (tf...) should be given the tag ta and not analyzed further according to gender, number and case as is done when numbers are written with alphabetic characters. (Steinþór Steingrímsson et al., 2015).
In his Master's dissertation Starkaður Barkarson (2017) discusses the effect of analysing foreign names with the tag e and the need to simplify the analysis of punctuation signs.



Version 0.9

Version 0.9 of MIM-GOLD was released in 2013 with 13 files after corrections performed during Phase 1. Mean accuracy was estimated as 96.4%, ranging between 89.9% and 98.5% depending on text domain. The text files are in Linux format and coded in UTF-8 codepage. The format of the files is one token per line, each line consists of the word, followed by a tab and then the tag. Sentences are separated by empty lines.

Here is a list of files in version 0.9 of MIM-GOLD.

Version 1.0

In version 1.0 of MIM-GOLD that is released now in 2018 there are 13 files with corrections of tags that have been performed until 2017 and with a modified tagset as described above. The texts are comparable to texts in version 0.9 apart from corrections of tokenization and corrections of tags. The format of the files is one token per line, each line consists of the word, followed by a tab and then the tag. Sentences are separated by empty lines.

Here is a list of files in version 1.0 of MIM-GOLD.

The corpus is distributed with a special user license which is based on the MIM license since the texts in MIM-GOLD were sampled from the MIM corpus.



1When publishing results based on the texts in MIM-GOLD please refer to:

Hrafn Loftsson, Jökull H. Yngvason, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2010. Developing a PoS-tagged corpus using existing tools. Sarasola, Kepa, Francis M. Tyers og Mikel L. Forcada (eds.): 7th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages, LREC 2010, pp. 53-60. Valetta, Malta.

Further information about the project can also be found in (Sigrún Helgadóttir et al., 2014) and (Steinþór Steingrímsson et al., 2015).



Cooperation and grants

The project was a joint project between The Árni Magnússon Institute for Icelandic Studies, Reykjavík University and the University og Iceland. Here is a list of other grants.

2 The Iclelandic Student Innovation Fund, Mörkun og leiðrétting nýrrar málheildar, April 2009. Main applicant Hrafn Loftsson.
3 Icelandic Research Fund (RANNÍS), grant 090662011, Viable Language Technology beyond English – Icelandic as a test case. 2009. Main applicant Eiríkur Rögnvaldsson.
4 The Iclelandic Student Innovation Fund, Íslensk staðalmálheild, 2010. Main applicant Eiríkur Rögnvaldsson.
5 Icelandic part of the META-NORD project, supported by the EU ICT Policy Support Porgramme as part of the Competitiveness and Innovation Framework Programme, grant agreement no 270899. February 1 2011 to January 31 2013. Project Manager Eiríkur Rögnvaldsson.
6 Icelandic Ministry of Education Science and culture as part of the Icelandic Government's IT Policy Programme.
7 Institute of Linguistics at the University of Iceland.
8 Icelandic Ministry of Education Science and Culture. Making of a plan for Language technology. Language Technology for Icelandic 2018-2022. June 2017.



The people behind the project

Hrafn Loftsson
Eiríkur Rögnvaldsson
Sigrún Helgadóttir
Jökull H. Yngvason
Kristján Friðbjörn Sigurðsson
Steinunn Valbjörnsdóttir
Brynhildur Stefánsdóttir
Jón Friðrik Daðason
Starkaður Barkarson



Using MIM-GOLD
The texts of MIM-GOLD are accessible in three different ways:

1. Here you can download the MIM-GOLD corpus, version 0.9. Distributed with a special user license.

2. Here you can download the MIM-GOLD corpus, version 1.0. Distributed with a special user license.

3. Here you can download training/testing sets for version 1.0 of MIM-GOLD. Distributed with a special user license. In order to train and test a particular tagging method a method is often used that is based on having ten pairs of training and test sets. Each training set contains about 90% of each of the 13 files og MIM-GOLD, the remianing 10% are contained in the corresponding test set. The testing sets do, therefore, not overlap but the training sets have about 80% in common. The tagger is trained and tested on all ten pairs and average accuracy is calculated. This is called a ten-fold cross-validation.

Contact
E-mail: malfong[at]malfong.is


References
Hrafn Loftsson. 2009. Correcting a POS-Tagged Corpus Using Three Complementary Methods. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009). s. 523-531. Athens, Greece.

Hrafn Loftsson, Jökull H. Yngvason, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2010. Developing a PoS-tagged corpus using existing tools. Sarasola, Kepa, Francis M. Tyers og Mikel L. Forcada (ritstj.): 7th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages, LREC 2010, pages 53-60. Valetta, Malta.

Hrafn Loftsson and Robert Östling. 2013. Tagging a morphologically complex language using an averaged perceptron tagger: The case of icelandic. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA-2013), NEALT Proceedings Series 16, Oslo, Norway.

Sigrún Helgadóttir, Hrafn Loftsson and Eiríkur Rögnvaldsson. 2014. Correcting Errors in a New Gold Standard for Tagging Icelandic Text. Proceedings of LREC 2014. pages 2944-2948. Reykjavík.

Starkaður Barkarson. 2017. Þjálfun málfræðimarkarans Stagger með nýjum gullstaðli. MA-thesis, University of Iceland, School of Humanities.

Steinþór Steingrímsson, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2015. Analysing Inconsistencies and Errors in PoS Tagging in two Icelandic Gold Standards. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015). Vilnius, Lithuania, May 2015. Linköping University Press. Svíþjóð.

Östling, R. (2012). Stagger: A modern POS tagger for Swedish. In Proceedings of the Swedish Language Technology Conference, SLTC, Lund, Sweden.