Icelandic Frequency Dictionary



The Icelandic Frequency Dictionary (IFD) was published in 1991. It presents the results of an extensive research on Modern Icelandic which targeted the frequency of words and grammatical features of texts of various kinds. The Preface of the IFD gives a detailed account of the work as a whole, both the research and the book itself. The results can then be found in the fourteen main chapters of the book, where the frequency of words and grammatical features are displayed in lists and tables. Jörgen Pind was the editor of the book with collaborators Friðrik Magnússon and Stefán Briem. A special text corpus was created for the making of the book. This website contains a description of the corups and makes it accessible online. The texts can be searched but they can also be downloaded and used for linguistic research and LT projects.



USING IFD

• Search the IFD.
• Download the IFD here. Distributed with a proprietary license.
• Download training/testing sets from the IFD here. Distributed with a proprietary license.

About IFD
A special text corpus was created for the making of the Icelandic Frequency Dictionary (Pind, Briem and Magnússon 1991), published by The Institute of Lexicography in 1991. Preparations for that corpus started in 1985 and a detailed descriptions of the work can be found in the preface to the book. There are fragments from 100 texts in the corpus; all published between the years 1980 and 1989. Each text contains about 5,000 running words. The texts were selected from five categories: Icelandic fiction (20 texts), translated fiction (20 texts), biographies and memoirs (20 texts), non-fiction (10 in the field of humanities, 10 in the field of science) and books for children and teenagers (10 original texts, 10 translations).

In the preface of the IFD 'a running word' is defined as a continuous sequence of letters and/or numbers and symbols that are separated by a space or punctuation. The rule is to incorporate as long strings of characters as possible under each numeral. Plus signs, minus signs and percentage symbols, therefore, belong to a running word. Hybrids of numbers and other characters, such as chemical formulae and mathematical formulae are considered one running word. It should be noted that abbreviations are in most cases analyzed as they are read.

The texts were divided into running words and in the corpus there are 590,279 running words that appear in 59,358 different word forms, including punctuation. The running words are followed by 639 morphosyntactic tags, including punctuation. In automatic grammatical analysis the most difficult part is to deal with word forms that can have more than one analysis. In the text corpus of IFD, 15.9% of the word forms are ambiguous as to tagsets within the IFD. The most ambiguous of the wordforms in the IFD is 'minni', that has 24 morphosyntactic tags in the corpus, but others are possible (ég minni þig á það 'I remind you of that'; ég geri þetta eftir minni 'I do this from memory'; Nonni er minni en Siggi 'Nonni is smaller than Siggi', etc.).

Each running word was then placed in a separate line. The morphosyntactic tags and the lemma were placed in the same line. The picture below shows one sentence from the novel "Mín káta angist" by Guðmundur Andri Thorson, and how it is analyzed. For greater clarity, an explanation is given of the morphosyntactic tag.



The preface to the IFD presents the automatic analysis that was employed in the making of the book (Pind, Magnússon and Briem 1991). The automatic analysis is based on an analysis of 54,000 running words that had been analyzed manually in a pilot project (Magnússon 1988). Stefán Briem (1990) outlines the methods applied in the automatic analysis. The authors of the IFD believe that about 80% of the running words were correctly analyzed in the automatic analysis. A few years later the program was refined based on an analysis of the whole text. That gave almost 90% accuracy (Briem p.c.).

The analysis of the running words that was used in IFD distinguishes between eight word classes: nouns, adjectives, pronouns, independent article, numerals, verbs, adverbs and conjunctions. Words that are not classified into these word classes were either considered foreign words or left unanallyzed. The main deviations from the standard word form analysis were that prepositions were classed as adverbs. Because of that there are adverbs that govern case. Exclamations were also classed as adverbs but the infinitive particle was classed as a conjunction.

The preface of the book contains a table that describes the morphosyntactic tags that were used in the making of the book. Each grammatical feature is represented with a letter. In the table the abbreviations are explained and and are in the order that they appear in frequency tables in the printed version. In the electronic version, however, the order of the letters in the morphosyntactic tags is as in the files that were used in the compilation of the frequency tables in the book. Only lower case letters are used for the grammatical features and the order is different from that presented in the frequency tables. In the electronic texts of the IFD corpus there are 639 morphosyntactic tags, including punctuation. Possible morphosyntactic tags in the tagset are, however, around 700.

Copyright

All the texts used in the making of IFD are protected by copyright. The material is made available in three different ways. Firstly, it can be searched. However, only 500 letters can be seen at a time (5-6 lines) which falls within the limits of copyright law (No. 73/1972). Second, users can obtain a copy of tagged texts of authors that have given permission for the material to be used for grammatical research and LT projects. No translated texts are included there. Third, the texts are made available for training of statistical taggers, but then they have been broken up and arranged so that they can't be identified. See more details in Using the IFD.

Correction of tags

While the tags in the texts of the IFD have been manually corrected it is inevitable that some errors still remain. Hrafn Loftsson made an attempt to automatically find and correct errors in the tags of the IFD (Loftsson 2009). The texts available through this website are corrected as described in that paper.

Read the Preface of the IFD.


Using IFD
The texts of the IFD are available for use in three different ways:

1. Search. The search is available through the website of the Tagged Icelandic Corpus and is comparable to the search in the corpus. Grammatical information can be used to refine the search. Bibliographic information is displayed for the texts that appear in the search results. Here is a list of the texts that can be searched.

2. Download. The texts are available in a special xml-format that is defined by TEI (Text Encoding Initiative). Bibliographical information is included with all the texts. Prospective users must register and accept the terms and conditions of a user license. Download.

3. Train statistical taggers. In order to train and test a particular tagging method a method is often used that is based on having ten pairs of training and test sets. In each computer file of the IFD there is a text of 5000 words from one source. The pairs were created in such a way that each file was divided into ten roughly equal parts. Each of these ten parts forms one test set and a corresponding training set contains the other nine parts. The set with the larger file is used as a training set and the smaller as a testing set. The testing sets do, therefore, not overlap but the training sets have about 80% in common. The tagger is trained and tested on all ten pairs and average accuracy is calculated. This is called a ten-fold cross-validation. Prospective users must register and accept the terms and conditions of a user license. Download.


Contact
Sigrún Helgadóttir
Project manager
Árni Magnússon Institute for Icelandic Studies
Office: Neshaga 16
Work phone: +354-525-4434
Fax: +354-562-7242
E-mail: sigruhel@hi.is
Web page: https://notendur.hi.is/sigruhel/


References
Helgadóttir, Sigrún. Mörkun íslensks texta (.pdf) Orð og tunga 9:75-107. Reykjavík. 2007.

Ingason, Anton K., Sigrún Helgadóttir, Hrafn Loftsson and Eiríkur Rögnvaldsson. 2008. A Mixed Method Lemmatization Algorithm Using Hierachy of Linguistic Identities (HOLI). In B. Nordström og A. Ranta (ritstj.), Advances in Natural Language Processing, 6th International Conference on NLP, GoTAL 2008, Proceedings. Gothenburg, Sweden.

Magnússon, Friðrik. 1988. Hvað er títt? Tíðnikönnun Orðabókar Háskólans. Orð og tunga 1:1-49.

Loftsson, Hrafn. 2006. Tagging Icelandic text: An experiment with integrations and combinations of taggers. Language Resources and Evaluation 40(2), 175-181.

Loftsson, Hrafn. 2006. Tagging a morphologically complex language using heuristics. In T. Salakoski, F. Ginter, S. Pyysalo og T. Pahikkala (ritstj.), Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006, Proceedings. Turku, Finland.

Loftsson, Hrafn. 2007. Tagging Icelandic Text using a Linguistic and a Statistical Tagger. In Proceedings of Human Language Technologies 2007: The Conference of the North American Chapter of the ACL. Rochester, NY, USA.

Loftsson, Hrafn. 2008. Tagging Icelandic text: A linguistic rule-based approach Appeared in a revised form, subsequent to editorial input by Cambridge University Press, in Nordic Journal of Linguistics, 31(1), 47-72. © 2008 Cambridge University Press.

Loftsson, Hrafn. 2009. Correcting a POS-Tagged Corpus Using Three Complementary Methods. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009). Athens, Greece.

Loftsson, Hrafn and Eiríkur Rögnvaldsson. 2007. IceNLP: A Natural Language Processing Toolkit for Icelandic. In Proceedings of InterSpeech 2007, Special session: "Speech and language technology for less-resourced languages". Antwerp, Belgium.

Loftsson, Hrafn, Ida Kramarczyk, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2009.Improving the PoS tagging accuracy of Icelandic text. In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA-2009). Odense, Denmark.

Loftsson, Hrafn, Jökull H. Yngvason, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2010. Developing a PoS-tagged corpus using existing tools. In Proceedings of "Creation and use of basic lexical resources for less-resourced languages", workshop at the 7th International Conference on Language Resources and Evaluation, LREC 2010. Valetta, Malta.

Loftsson, Hrafn, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2011.Using a morphological database to increase the accuracy in PoS tagging. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011). Hissar, Bulgaria.

Pind, Jörgen (ed.), Friðrik Magnússon and Stefán Briem. 1991. Íslensk orðtíðnibók. Orðabók Háskólans, Reykjavík.

Stefán Briem. 1990. Automatisk morfologisk analyse af islandsk tekst. Jörgen Pind og Eiríkur Rögnvaldsson (ritstj.). Papers from the Seventh Scandinavian Conference of Computational Linguistics Reykjavík 1989:3–13. Institute of Lexicography, Institute of Linguistics, Reykjavík.