The Icelandic Gigaword Corpus

The Icelandic Gigaword corpus (IGC) consists of about 1300 million running words of text. The Gigaword corpus is a tagged corpus which means that each running word is accompanied by a morphosyntactic tag and lemma and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.


About IGC
What is a tagged corpus?

A tagged corpus is a collection of electronic texts in a standard format. The texts are analyzed in various ways to make them suitable for linguistic research and Language Technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphosyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accompanied by metadata (bibliographic information for published texts).

The compilation of the Icelandic Gigaword Corpus
The Icelandic Gigaword corpus consists of about 1300 million running words of text. Part of the corpus texts are official texts (e.g. parliamentary speeches as far back as 1907, law text, adjudications). The corpus also contains big text collections from news media and various texts from the text collection of the Árni Magnússon Institute for Icelandic Studies. The Gigaword corpus is a tagged corpus as described above. The corpus was compiled during the years 2015 to 2017 at the Árni Magnússon Institute for Icelandic Studies. Only texts available in digital form were collected.

To enable the use of the corpus in Language Technology projects it was considered important to secure copyright clearance for the texts to be used. Originally the idea was to secure permission from copyright owners to give access to the texts with Creative Commons licenses . All copyright holders could not agree to those terms. The corpus is therefore divided into two parts, IGC1 and IGC2. IGC1 contains texts that can be used with a special license developed for the Tagged Icelandic Corpus (MIM). IGC2 contains official texts and texts that can be used with a CC BY license. All copyright holders have agreed that their material may be used free of licensing charges. Copyright owners that did not accept the CC BY license signed a special declaration developed for the Tagged Icelandic Corpus with necessary amendments for the IGC1.

To be able to gain an overview of where the texts in the corpus originate they have been classified into 6 categories. The largest portion of text in IGC comes from web media, just over 38%. From printed papers there are just under 30% of texts in the corpus, from radio and television just over 4% and official texts are 26% of the corpus. From the text collection of The Árni Magnússon Institute for Icelandic studies there is less than 1% of the texts. Other texts (about 0.7%) come from the University of Iceland Science Web and the Icelandic part of Wikipedia. The ratio provided for each text category is based on the number of running words in the texts. Texts made available with a special license based on the license of the Tagged Icelandic Corpus (IGC1) are just under 57% of the corpus and the remainder are texts made available with the CC BY license (IGC2). Just over 86% of the texts are from the period after the year 2000 and just over 94% of the texts from the period after 1980. The oldest texts are parliamentary speeches from 1907 and a few texts from old newspapers and magazines from before 1900.

There are 4.154.058 files with running words in IGC.

It is possible to extract all kinds of useful information from the corpus such as information on the frequency of word classes, words and word forms, phrases, syntax and semantics. Such data are useful for dictionary compilation, the making of spell checkers and grammar checkers, translation software, tools for speech recognition and speech synthesis and the making of tools for the blind, those hard-of-hearing and those that are motor-handicapped and persons suffering dyslexia.

Tagging the corpus
The corpus was tagged by automatic means. The texts in IGC were divided into sentences and running words and then tagged and lemmatized. IceNLP was used to divide the text into sentences and running words. Tagging was performed with IceStagger2. Lemmatization was performed with the lemmatizer Nefnir3. Tags and lemmas are not manually corrected.

The tagset used for tagging IGC was developed for the making of the Icelandic Frequency Dictionary (IFD) with a few changes: proper nouns are not analyzed specially as person names, place names and other names as was done in the IFD; the tag v is used for url's and e-mail addresses; abbreviations are not divided into individual words and are tagged with the tag as; all number constants are tagged with the tag ta. A corpus made by concatenating the IFD corpus and the MIM-GOLD corpus was used to train IceStagger. Dictionaries used when tagging were augmented with the dictionary of The Database of Modern Icelandic Inflection BÍN.

Tagset for the Icelandic Gigaword corpus.

How will the corpus be used?
The corpus will be available in two ways. Firstly the corpus will be available for search where the tags (linguistic annotation) can be used to define the search more accurately. The search gives a KWIC index and information about the source of each text example. The search interface is based on the Swedish search interface Korp.

Secondly the texts of the corpus can be used in Language Technology projects. Prospective users accept a special user license for IGC1 and CC BY license for IGC2. The texts are in a special xml format, TEI P5, which is defined by TEI (Text Encoding Initiative). All texts are accompanied by metadata (bibliographic information for published works). All users are registered with their e-mail address when they accept the user license.

Cooperation and financing
The corpus was compiled during the years 2015 to 2017 at the Árni Magnússon Institute for Icelandic Studies and was funded mostly by the Infrastructure Fund (no. 151110-0031, project manager Eiríkur Rögnvaldsson), the Contribution grants fund (Mótframlagssjóður) at the University of Iceland and the Ministry of Education and Culture. The company Creditinfo gave assistance in retrieving texts from radio and television and from some web media and printed media.

Project group
Eiríkur Rögnvaldsson
Sigrún Helgadóttir
Steinþór Steingrímsson

Software development
Gunnar Thor Örnólfsson
Kristján Rúnarsson
Starkaður Barkarson

1When publishing results based on the texts in the Icelandic Gigaword Corpus please refer to: The Icelandic Gigaword Corpus. Project management: Eiríkur Rögnvaldsson, Sigrún Helgadóttir and Steinþór Steingrímsson. The Arni Magnússon Institute for Icelandic Studies. Downloaded [DATE] from The same applies to the release of any language technology tools that have used IGC.

2Hrafn Loftsson and Robert Östling. 2013. Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA-2013), NEALT Proceedings Series 16. Oslo, Norway.

3Nefnir is a new lemmatizer developed by Jón Friðrik Daðason. The program has not been described.