The Saga Corpus



On this page access is given to electronic texts of some Old Icelandic narrative texts: Family Sagas (Íslendingasögur), Sturlunga Saga, Sagas of the Kings of Norway (Heimskringla) and the Book of Settlement (Landnámabók). With the exception of Landnámabók, the texts are from the publication of Svart á hvítu and Mál og menning that were published between 1985 and 1991.

The texts have been normalized to Modern Icelandic spelling. Several inflectional endings were also changed to Modern Icelandic form. The texts can be searched and they can also be downloaded for use in linguistic research and LT projects.





USING THE SAGA CORPUS1

• Search the Saga Corpus on the corpus page of the Árni Magnússon Institute.
• Download here. Distributed with a CC BY 4.0 license.

About the Saga Corpus
The Texts

The corpus that is made available here contains 41 texts from the Family Sagas, Sturlunga Saga, Heimskringla and the Book of Settlement. The division of the corpus is shown in the table below. Numbers refer to running words excluding punctuation.


Text Words
Family Sagas 982.066
Sturlunga Saga 260.586
Heimskringla 231.502
Book of Settlement 37.120
Total 1.511.275


The texts of the Family Sagas are taken from the publication of Svart á hvítu (Bragi Halldórsson, Jón Torfason, Sverrir Tómasson and Örnólfur Thorsson (eds.), 1985-1986) and also the text of Sturlunga Saga (Örnólfur Thorsson, Bergljót Kristjánsdóttir, Bragi Halldórsson, Gísli Sigurðsson, Guðrún Ása Grímsdóttir, Guðrún Ingólfsdóttir, Jón Torfason and Sverrir Tómasson (eds.), 1988). The text of Heimskringla is from the publication of Mál og menning from the year 1991 (Bergljót Kristjánsdóttir, Bragi Halldórsson, Jón Torfason and Örnólfur Thorsson (eds.), 1991). The spelling was normalized to Modern Icelandic spelling and some inflectional endings were changed to Modern Icelandic form. The text of the Book of Settlement is from the publication of Jakob Benediktsson from 1968 (Jakob Benediktsson, 1968). The book was scanned and the text normalized to Modern Icelandic spelling in the same way as the other texts. List of the texts can be found here. One of the texts is Íslendingaþættir, a collection of tales, called þættir.

What was changed?

Transliteration to modern spelling includes reducing the number of vowel symbols ('æ' is used for both 'æ' and 'œ', 'ö' is used for both 'ø' and 'ǫ', the letter u is inserted between a consonant and r at the end of a word (maðr > maður), ss and rr at the end of a word is shortened (íss > ís, herr > her) and t and k at the end of a word in unstressed syllables is changed to ð and g (þat > það, ok > og). Furthermore, some inflectional endings were changed to Modern Icelandic form.

Tagging experiments

Three attempts were made at tagging the texts. The first experiments were performed in 2005. The texts were at first tagged with a method that was developed for Modern Icelandic. Methods of tagging Icelandic text have been developed using tagged texts of the Icelandic Frequency Dictionary. The data-driven tagger TnT (Brants, 2000) was trained on the tagged texts of the Icelandic Frequency Dictionary (Sigrún Helgadóttir, 2004, 2007). A model was created, that can be used to tag new texts and all the texts in the Saga Corpus were tagged using this model. In order to measure the tagging accuracy four randomly selected samples of 1000 words each were used; one from the Family Sagas, one from Heimskringla and two from the Sturlunga Saga. The tags in these samples were corrected manually. When the correct tags in these examples were counted the tagging accuracy was 88% whereas it was 90.4% in the texts from the Icelandic Frequency Dictionary. The structure of sentences in Old Icelandic is quite different from that in Modern Icelandic. Different word order should particularly affect the accuracy of a statistical tagger such as TnT, which is based on trigrams. However, sentences in Old Icelandic texts are generally very short and it is easier to analyze short sentences than long ones.

Then seven whole texts (sagas) and two fragments from the Sturlunga collection were selected for manual correction, in total 95,000 words. The TnT tagger was trained on these texts and the new model used for the tagging of the whole corpus. Accuracy was again measured on the four samples which resulted in 91.7% accuracy. Finally the TnT-tagger was trained on the union of the corrected Old Icelandic texts and the Modern Icelandic texts. The Old Icelandic Corpus was then tagged using this model. Accuracy was measured in the same way as before and reached 92,7% (Eiríkur Rögnvaldsson and Sigrún Helgadóttir, 2011).

In 2013 Hrafn Loftsson and Robert Östling experimented again with the tagging of the Old Icelandic texts (Hrafn Loftsson and Robert Östling, 2013). They corrected the training corpus from the Sturlunga collection partly automatically and partly manually and corrected in total 2,144 tags. They tested three taggers, the best performing tagger was Stagger (Östling, 2012). The authors tagged the corrected training corpus (which they call SAGA-GOLD) using 10-fold cross-validation and adding the IFD corpus to each training fold. By doing this they obtained mean accuracy 91.76%. The authors also combined the output of three taggers (TriTagger, HMM+Ice+HMM (Hrafn Loftsson et al., 2009) and Stagger) and obtained 92.32% accuracy.

Starkaður Barkarson retagged in January 2018 all the texts of the Saga Corpus using Stagger. A new training corpus was made by concatenating the IFD corpus (about 500 thousand running words), the texts from the Sturlunga collections (about 95 thousand running words, SAGA-GOLD) with corrections performed by Hrafn Loftssonn and the new gold standard for Icelandic, MIM-GOLD (about 1 million running words). Tagging accuracy was estimated with the same method as in the first experiment, i.e. by tagging the three 1000 word samples and comparing the tags to the corrected tags. The accuracy was estimated to be 93.5%. This number is not comparable to the result obtained by Hrafn Loftsson and Robert Östling since they did not use the 1000 word samples but did a ten-fold cross-validation instead.

Tagging and lemmatizing the texts

The texts were tokenized and split into sentences by using the IceNLP suite. The texts were tagged with the tagger Stagger as described above and lemmatized with the lemmatizer Nefnir. Nefnir is a new lemmatizer by Jón Friðrik Daðason and has not been described yet but it gives better results than the previously used lemmatizer (Lemmald, (Anton Ingason et al., 2008)).

After tagging was completed tags in the part of the Sturlunga texts that are also a part of the training corpus were restored to the corrected value.

Texts available before March 2018

It should be pointed out that the texts available through this website until March 2018 were tagged with the same procedures as the texts in the Tagged Icelandic Corpus. Some flaws were also found in some of the texts. The texts that are now available are tagged with the best available method by utilizing a training corpus with medieval texts as described above and the flaws have been corrected.


Using the Saga Corpus
The texts of the Saga Corpus are available for use in two different ways:

1. Search. The search is available through the website of the corpus page of the Árni Magnússon Institute. Grammatical information can be used to refine the search. Bibliographic information is displayed for the texts that appear in the search results. Here is a list of the texts that can be searched. On the search page it is possible to choose any of the texts for search. One of the works is Íslendingaþættir. It is also possible to choose any of those for search.

2. Download. The texts are available in a special xml-format that is defined by TEI (Text Encoding Initiative). Bibliographical information is included with all the texts. Prospective users must register and accept the terms and conditions. Download. The texts are accessible with a CC BY 4.0 licence.


1When publishing results based on the texts in the Saga Corpus please refer to: : Eiríkur Rögnvaldsson and Sigrún Helgadóttir. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. Sporleder, Caroline, Antal P.J. van den Bosch og Kalliopi A. Zervanou (ritstj:): Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series. s. 63–76. Springer, Berlín. The corpus is also available for search at malheildir.arnarstofnun.is


Contact


References
Anton K. Ingason, Sigrún Helgadóttir, Hrafn Loftsson and Eiríkur Rögnvaldsson. 2008. A Mixed Method Lemmatization Algorithm Using Hierachy of Linguistic Identities (HOLI). Í B. Nordström og A. Ranta (ritstj.), Advances in Natural Language Processing, 6th International Conference on NLP, GoTAL 2008, Proceedings. Gothenburg, Sweden.

Brants, Thorsten. 2000. TnT - A Statistical Part-of-Speech Tagger. Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, bls. 224-231. Seattle, Washington, USA.

Bragi Halldórsson, Jón Torfason and Örnólfur Thorsson (eds.). 1985-1986. Íslendinga sögur. Svart á hvítu. Reykjavík.

Bergljót Kristjánsdóttir, Bragi Halldórsson, Jón Torfason and Örnólfur Thorsson (eds.). 1991. Heimskringla. Mál og menning. Reykjavík.

Eiríkur Rögnvaldsson and Sigrún Helgadóttir. 2011. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. Sporleder, Caroline, Antal P.J. van den Bosch og Kalliopi A. Zervanou (ritstj:): Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series, s. 63-76. Springer, Berlín.

Hrafn Loftsson, Ida Kramarczyk, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2009. Improving the PoS tagging accuracy of Icelandic text. In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA-2009). Odense, Denmark.

Hrafn Loftsson and Robert Östling. 2013. Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic. Í Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA-2013), NEALT Proceedings Series 16. Oslo, Norway.

Jakob Benediktsson (ed.). 1968. Íslenzk fornrit I. Íslendingabók - Landnámabók. Hið íslenzka fornritafélag.

Sigrún Helgadóttir. 2004. Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In H. Holmboe (ed.): Nordisk Sprogteknologi. Museum Tusculanums Forlag.

Sigrún Helgadóttir. 2007. Mörkun íslensks texta (.pdf) Orð og tunga 9:75-107. Reykjavík.

Örnólfur Thorsson, Bergljót Kristjánsdóttir, Bragi Halldórsson, Gísli Sigurðsson, Guðrún Ása Grímsdóttir, Guðrún Ingólfsdóttir, Jón Torfason and Sverrir Tómasson (eds.). 1988. Sturlunga saga. Svart á hvítu. Reykjavík.

Östling, Robert. 2012. Stagger: A modern POS tagger for Swedish. In Proceedings of the 4 th Swedish Language Technology Conference, SLTC, Lund, Sweden.