Tagged Icelandic Corpus



The Tagged Icelandic Corpus (MÍM) is a morphosyntactically tagged corpus of Icelandic consisting of about 25 million tokens of contemporary Icelandic texts collected from varied sources during the years 2006-2010. The corpus is intended for use in Language Technology projects and for linguistic research. The corpus is available for search through a web interface and for download in TEI-conformant XML format. Each text in the corpus is accompanied by metadata.

USING MIM1

• Search MIM.
• Download MIM here. With a special user license


About MÍM
What is a tagged corpus?
A tagged corpus is a cellection of elecronic texts in a standard format. The texts are analysed in various ways to make them suitable for linguistic research and Language Technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphsyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accpmpanied by metadata (bibiliographic information for published texts).

The compilation of the Icelandic Corpus
The project to compile a tagged corpus containing Icelandic contemporary texts was started in 2004 at the Institute of Lexicography. The project was continued in 2006 at the newly founded Árni Magnússon Institue for Icelandic Studies (AMI) when the Institute of Lexicography became a part of that institute. The corpus should contain about 25 million running words of texts from different genres of Icelandic written in the twenty first century. One of the main criteria for the compilation of the corpus was that it should contain a “balanced” or a “representative” text collection. The texts are written during the years 2000-2010 and are, with one exception, original writings in Icelandic written by native speakers of Icelandic. Only texts that were available electronically were collected.

To enable the use of the corpus in Language Technology projects it was considered important to secure copyright clearance for the texts to be used. All owners of copyrighted text signed a speicial declaration and agreed that their material may be used free of licensing charges.

It was anticipated that most of the texts would be protected by copyright (final figure is about 86%). Early on in the project, cooperation was secured from the Writer's Union of Iceland, the Association of Non-fiction and Educational Writers in Iceland and the Icelandic Publishers' Association. All these associations recommended to their members that they should cooperate with the project. The most important of these, and the most difficult to secure, was the recommendation of the publishers' association, since publishers are normally the keepers of digital copies of published material. When a permission had been obtained from an author of a published book the publisher was contacted to obtain an elctronic copy of the text. Both informative writings and imaginative writings were collected. Texts from published books make up just under 24% of the texts in the corpus.

The second largest portion of text, about 22%, is taken from newspapers, mostly from printed newspapers (less than 1% from two online newspapers). The printed newspapers are Morgunblaðið (20%) and Fréttablaðið (2%). Text from various printed periodicals is about 9.5% of the corpus. About 14% of the texts in the corpus are official texts and therefore not covered by copyright. These are speeches from the Icelandic Parliament (Alþingi), (about 2% of the corpus texts), legal texts and adjudications (5.2%), and texts from the websites of government ministries (6.8%). All these texts, apart from the parliamentary speeches that were obtained from the database of Alþingi, were harvested directly from the respective websites. Here is list of text categories in the corpus. Here is a list of all texts in the corpus.

Copyright owners were given a copy of the user license that users have to agree to in order to be able to download the corpus texts.

It is possible to extract all kinds of useful information from the corpus such as information on the frequency of word classes, words and word forms, phrases, syntax and semantics. Such data are useful for dictionary compilation, the making of spell checkers and grammar checkers, translation software, tools for speech recognition and speech synthesis and the making of tools for the blind, those hard-of-hearing and those that are motor-handicapped and persons suffering dyslexia.

Cooperation and financing
During the first years the project was financed by the Language Technolgy project of the Ministry of Education, Science and Culture. The research project Variation in Syntax supplied the spoken component of the corpus. The project was partly financed by the research project Viable Language Technology Beyond English which received a Grant of Excellence from the from The Icelandic Research Fund during the years 2009-2011. The project was financed from February 2011 to January 2013 by the Icelandic part of the META-NORD project which is a cooperation between the Nordic and Baltic Countries and is a part of META-NET. Special parts of the project have been financed by grants from The University Research Fund and the Icelandic Student Innovation Fund. The Árni Magnússon Institute for Icelandic Studies is a partner of The Icelandic Centre for Language Technology - ICLT. Researchers affiliated to the ICLT have also taken part in the compilation of the corpus.

Tagging the corpus
The corpus was tagged by automatic means. The software used, CorpusTagger, was developed for the work on the MIM-GOLD corpus (Hrafn Loftsson et al., 2010). The text was segmented into sentences and tokenized with the IceNLP software. The text was tagged with four taggers: fnTBL, MXPOST (Ratnaparkhi, 1996), TriTagger which is a part of the IceNLP software and is a re-implementation of the well known Hidden Markov Model (HMM) tagger TnT (Brants,2000) and IceTagger (Loftsson, 2008) which is a rule-based tagger and also a part of the IceNLP software. The taggers fnTBL, MXPOST og TriTagger are all data-driven taggers that were trained on the IFD corpus. The IFD corpus was also used for the development of the rule-bassed tagger IceTagger. Finally the software CombiTagger was used to vote between the tags. The MÍM corpus is thus tagged with the tagset of the IFD corpus with the exception that proper names are not classified as personal names, place names and other proper names. The text was lemmatized with the tool Lemmald (Anton Ingason o.fl., 2008) which also is a part of the IceNLP software. The automatic morphosyntactic tagging accuracy has bee estimated as 88,1-95,1% depending on text type (Hrafn Loftsson o.fl., 2010) and the lemmatization accuracy is estimated as approximately 90%.

The MÍM tagset.

Word frequency
The text was lemmatized with the tool Lemmald as already mentioned and the lemmatization accuracy was estimated to be approximately 90%. To be able to obtain reliable figures for the frequency of lemmas it is necessary that lemmatization accuracy is considerably higher. However, to obtain some idea about the frequency of lemmas the frequency of lemmas that occur more often than 100 times is shown. The Excel file contains 14 sheets. The first sheet (freq) contains lemmas that occur more than 100 times sorted by frequency. Word class (pos) is specified, i.e. the first character of the tag. These letters are used: a: adverbs; c: conjunctions; e: foreign words; f: pronouns; g: article; l: adjectives; n: nouns; s: verbs; t: numerals; x: unspecified. It should be pointed out that prepositions are classified as adverbs. In the next sheet (alphabetic) lemmas are in alphabetical order. In the following sheet (freq(alphab)) lemmas are ordered by frequency but lemmas with the same frequency are ordered alphabetically. In the next sheet ((pos(freq(alphb))) lemmas are orderd by pos, then frequency and alphabetically at last. The following sheets contain lemmas for each word class where the lemmas are ordered by frequency and then alphabetically.

How will the corpus be used?
The corpus will be available in two ways. Firstly the corpus will be available for search where the the tags (linguistic annotation) can be used to define the search more accurately. The search interface is based on the Norwegian interface Glossa which in turn uses the search engine Corpus Workbench. Corpus Workbench is collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. The interface developed for MÍM has also been used for other corpora like The IFD Corpus and The Saga Corpus

Secondly the texts of the corpus can be downloaded for use in Language Technology projects. Prospective users accept a special user license. The texts are in a special xml format, TEI P5, which is defined by TEI (Text Encoding Initiative). All texts are accompanied by metadata (bibliographic information for published works). All users are registered when they accept the user license.

Derived projects
The most important project derived from MÍM is the GOLD standard for POS tagging (MIM-GOLD) which is a corpus containing about one million tokens of texts that were selected from the texts MÍM and the tags have been manually corrected. The MIM-GOLD corpus will be used for training of data-driven part-of-speech taggers.

Project Manager
Sigrún Helgadóttir

Project group
Ásta Svavarsdóttir
Eiríkur Rögnvaldsson
Kristín Bjarnadóttir

Other co-workers
Auður Þórunn Rögnvaldsdóttir (preparatory stage)
Eyrún Ellý Valsdóttir (text collection and text cleaning)
Hjördís Stefánsdóttir (text collection and text cleaning)
Guðmundur Örn Leifsson (search interface)
Kristján Friðbjörn Sigurðsson (manually checking tags in MIM-GOLD)
Jökull Huxley Yngvason (CorpusTagger)
Kristín Margrét Jóhannsdóttir (metadata and text cleaning)
Steinþór Steingrímsson (import to TEI format, search interface)

More about the corpus
More information about the compilation of the corpus will appear on this webpage during the next weeks and months.



1When publishing results based on the texts in the Tagged Icelandic Corpus please refer to:

Mörkuð íslensk málheild. [The Tagged Icelandic Corpus.] (n.d.) Sigrún Helgadóttir, editor. The Árni Magnússon Institute for Icelandic Studies. Accessed [DATE] from malfong.is.


Contact
Sigrún Helgadóttir
Project Manager
The Árni Magnússon Institute for Icelandic Studies
Office: Neshagi 16
Work phone: +354-525-4434
Fax: +354-562-7242
E-mail: sigruhel@hi.is
Web page: https://notendur.hi.is/sigruhel/


References
Anton K. Ingason, Sigrún Helgadóttir, Hrafn Loftsson og Eiríkur Rögnvaldsson. 2008. A Mixed Method Lemmatization Algorithm Using Hierachy of Linguistic Identities (HOLI). Í B. Nordström og A. Ranta (ritstj.), Advances in Natural Language Processing, 6th International Conference on NLP, GoTAL 2008, Proceedings. Gothenburg, Sweden.

Brants, Thorsten. 2000. TnT - A Statistical Part-of-Speech Tagger. Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, bls. 224–231. Seattle, Washington, USA.

Hrafn Loftsson. 2008. Tagging Icelandic text: A linguistic rule-based approach. Birtist í breyttu formi í Cambridge University Press, í Nordic Journal of Linguistics, 31(1), 47-72. © 2008 Cambridge University Press.

Hrafn Loftsson, Jökull H. Yngvason, Sigrún Helgadóttir og Eiríkur Rögnvaldsson. 2010. Developing a PoS-tagged corpus using existing tools. Í Proceedings of "Creation and use of basic lexical resources for less-resourced languages", workshop at the 7th International Conference on Language Resources and Evaluation, LREC 2010. Valetta, Malta.

Ratnaparkhi, A. 1996. A Maximum Entropy Model for Part-of-Speech Tagging. In Proceedings of the Conference on Empirical Mehods in Natural Lanugage Processing (EMNLP-96), bls. 133–143. Philadelphia. PA.

Further reading
Friðrik Magnússon. 1988. Hvað er títt? Tíðnikönnun Orðabókar Háskólans. Orð og tunga 1:1-49.

Hrafn Loftsson. 2006. Tagging Icelandic text: An experiment with integrations and combinations of taggers. Language Resources and Evaluation 40(2), 175-181.

Hrafn Loftsson. 2006. Tagging a morphologically complex language using heuristics. Í T. Salakoski, F. Ginter, S. Pyysalo og T. Pahikkala (ritstj.), Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006, Proceedings. Turku, Finland.

Hrafn Loftsson. 2007. Tagging Icelandic Text using a Linguistic and a Statistical Tagger. Í Proceedings of Human Language Technologies 2007: The Conference of the North American Chapter of the ACL. Rochester, NY, USA.

Hrafn Loftsson. 2009. Correcting a POS-Tagged Corpus Using Three Complementary Methods. Í Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009). Athens, Greece.

Hrafn Loftsson og Eiríkur Rögnvaldsson. 2007. IceNLP: A Natural Language Processing Toolkit for Icelandic. Í Proceedings of InterSpeech 2007, Special session: "Speech and language technology for less-resourced languages". Antwerp, Belgium.

Hrafn Loftsson, Ida Kramarczyk, Sigrún Helgadóttir og Eiríkur Rögnvaldsson. 2009.Improving the PoS tagging accuracy of Icelandic text. Í Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA-2009). Odense, Denmark.

Hrafn Loftsson, Sigrún Helgadóttir og Eiríkur Rögnvaldsson. 2011.Using a morphological database to increase the accuracy in PoS tagging. Í Proceedings of Recent Advances in Natural Language Processing (RANLP 2011). Hissar, Bulgaria.

Jörgen Pind (ritstj.), Friðrik Magnússon and Stefán Briem. 1991. Íslensk orðtíðnibók. Orðabók Háskólans, Reykjavík.

Sigrún Helgadóttir. Mörkun íslensks texta (.pdf) Orð og tunga 9:75-107. Reykjavík. 2007.

Sigrún Helgadóttir, Ásta Svavarsdóttir, Eiríkur Rögnvaldsson, Kristín Bjarnadóttir og Hrafn Loftsson. 2012. The Tagged Icelandic Corpus (MÍM). Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages - SaLTMiL 8 - AfLaT2012, s. 67-72. Istanbúl, Tyrklandi.