The Saga Corpus



On this page access is given to electronic texts of some of the Old Icelandic sagas: Family Sagas (Íslendingasögur), Sturlunga Saga, (Sagas of the Kings of Norway (Heimskringla) and the Book of Settlement (Landnámabók). With the exception of Landnámabók, the texts are from the publication of Svart á hvítu and Mál og menning that were published between 1985 and 1991.

The texts have been normalized to Modern Icelandic spelling. Several inflectional endings were also changed to Modern Icelandic form. The texts can be searched and they can also be downloaded for use in linguistic research and LT projects.




USING THE SAGA CORPUS

• Search the Saga Corpus.
• Download here. Distributed with a CC BY 3.0 license.

About the Saga Corpus
The Texts

The corpus that is made available here contains 41 texts from the family sagas, Sturlunga Saga, Heimskringla and the Book of Settlement. The division of the corpus is shown in the table below.

The texts of the Family Sagas and Sturlunga Saga are taken from the publication of Svart á hvítu from the years 1985 to 1988. The text of Heimskringla is from the publication of Mál og menning from the year 1991. The spelling was normalized to Modern Icelandic spelling and some inflectional endings were changed to Modern Icelandic form. The text of the Book of Settlement is from the publication of Jakob Benediktsson from 1968. The book was scanned and the text normalized to Modern Icelandic spelling in the same way as the other texts. List of the texts can be found here.

What was changed?

Transliteration to modern spelling includes reducing the number of vowel symbols ('æ' is used for both 'æ' and 'œ', 'ö' is used for both 'ø' and 'ǫ', the letter u is inserted between a consonant and r at the end of a word (maðr > maður), ss and rr at the end of a word is shortened (íss > ís, herr > her) and t and k at the end of a word in unstressed syllables is changed to ð and g (þat > það, ok > og). Furthermore, some inflectional endings were changed to Modern Icelandic form.

Tagging

The texts were at first tagged with a method that was developed for Modern Icelandic. Methods of tagging Icelandic text have been developed using tagged texts of the Frequency Dictionary. The data-driven tagger TnT (Brants 2000) was trained on the tagged texts of the Icelandic Frequency Dictionary (Sigrún Helgadóttir 2004, 2007). A model was created, that can be used to tag new texts and all the texts in the Saga Corpus were tagged using this model. In order to measure the tagging accuracy four randomly selected samples of 1000 words each were used; one from the Family Sagas, one from Heimskringla and two from the Sturlunga Saga. The tags in these samples were corrected manually. When the correct tags in these examples were counted the tagging accuracy was 88% whereas it was 90.4% in the texts from the Icelandic Frequency Dictionary. The structure of sentences in Old Icelandic is quite different from that in Modern Icelandic. Different word order should particularly affect the accuracy of a statistical tagger such as TnT, which is based on trigrams. However, sentences in Old Icelandic texts are generally very short and it is easier to analyze short sentences than long ones.We then selected seven whole texts (sagas) and two fragments from the Sturlunga collection for manual correction, in total of 95,000 words. The TnT-tagger was trained on these texts and the new model used for the tagging of the whole corpus. Accuracy was again measured on the four samples which resulted in 91.7% accuracy. Finally the TnT-tagger was trained on the union of the corrected Old Icelandic texts and the Modern Icelandic texts. The Old Icelandic Corpus was then tagged using this model. Accuracy was measured in the same way as before and reached 92,7%.

On this website the texts have been tagged and lemmatized in the same way as the texts in the MÍM Corpus.


Using the Saga Corpus
The texts of the Saga Corpus are available for use in two different ways:

1. Search. The search is available through the website of the Tagged Icelandic Corpus (MIM) and is comparable to the search in the corpus. Grammatical information can be used to refine the search. Bibliographic information is displayed for the texts that appear in the search results. Here is a list of the texts that can be searched.

2. Download. The texts are available in a special xml-format that is defined by TEI (Text Encoding Initiative). Bibliographical information is included with all the texts. Prospective users must register and accept the terms and conditions. Download. The texts are accessible with a CC BY 3.0 licence.


Contact
Eiríkur Rögnvaldsson
professor, Icelandic linguistics
Faculty of Icelandic and Comparative Cultural Studies
School of Humanities, University of Iceland
Office: Árnagarði, 415
Work phone: +354-525-4403
Fax: +354-525-4242
e-mail: eirikur@hi.is;
eirikur.rognvaldsson@gmail.com.


References
Benediktsson, Jakob (ed.). 1968. Íslenzk fornrit I. Íslendingabók - Landnámabók. Hið íslenzka fornritafélag.

Brants, Thorsten. 2000. TnT - A Statistical Part-of-Speech Tagger. Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, p. 224-231. Seattle, Washington, USA.

Halldórsson, Bragi, Jón Tómasson, Örnólfur Thorsson (ed.). 1985-1986. Íslendinga sögur. Svart á hvítu. Reykjavík.

Helgadóttir, Sigrún. 2004. Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In H. Holmboe (ed.): Nordisk Sprogteknologi. Museum Tusculanums Forlag.

Helgadóttir, Sigrún. 2007. Mörkun íslensks texta (.pdf) Orð og tunga 9:75-107. Reykjavík.

Kristjánsdóttir, Bergljót, Bragi Halldórsson, G. Sigurðsson, Guðrún Ása Grímsdóttir, Guðrún Ingólfsdóttir, Jón Torfason, Sverrir Tómasson, Örnólfur Thorsson (ed.). 1988. Sturlunga saga. Svart á hvítu. Reykjavík.

Kristjánsdóttir, Bergljót, Bragi Halldórsson, Jón Torfason, Örnólfur Thorsson (ed.). 1991. Heimskringla. Mál og menning. Reykjavík.

Rögnvaldsson, Eiríkur and Sigrún Helgadóttir. 2011. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. Sporleder, Caroline, Antal P.J. van den Bosch og Kalliopi A. Zervanou (ed.): Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series, p. 63-76. Springer, Berlín.