ParIce is a parallel corpus of Icelandic and English texts, containing 3.589.052 sentence pairs from 11 different text collections. The texts were aligned on sentence level and word level. The text have also been tagged and lemmatized.






Using ParIce1

• Search ParIce.
• Download ParIce. With CC BY licence.

About ParIce
The texts
ParIce contains texts in Icelandic and English where sentences or part of sentences og both languages have been aligned. In total the corpus contains 3.589.052 sentence pairs. The Icelandic text contains 46.727.741 tokens.

The texts are collected from eleven different text collections. They were mostly collected from available parallel corpora (Opus, Tilde, ELRC) or retrieved from websites. Further informatoin about the origin of the texta are found here.

Text Lines
The Bible 65.241
EEA documents 1.701.172
Package leaflets (EMA) 404.333
Press releases from European Southern Observatory (ESO) 12.633
Statistics Iceland - from website 2.288
Icelandic Sagas 17.597
KDE4 49.909
Classical litterature 12.416
OpenSubtitles 1.304.628
Tatoeba 8.263
Ubuntu 10.572
Total 3.589.052


Alignment, filtering, tagging and lemmatization
Information about the processing of the texts, i.e. alignment, filtering, tagging and lemmatization, can be found in an article by Starkaður Barkarson and Steinþór Steingrímsson (2019).

Project manager
Steinþór Steingrímsson

Software development
Rose Costa
Starkaður Barkarson



Using the corpus
All texts (excluding the one originating from http://opensubtitles.org - see below) are avilable in two ways:

1. Search. The corpous is available for search. It's possible to search for one or more words or lemmas and use tags (linguistic annotation) to define the search more accurately. The search interface is based on the Swedish search interface Korp.

2. Download. Excluding the texts orignating from http://opensubtitles.org, are available in tmx-format where the sentences have been paired, and a special xml-format, TEI P5, which is defined by TEI (Text Encoding Initiative), where the texts have been tokenized, tagged and lemmatized. The xml-files contain information that can be used to align both setnences and words between to two langauges. All users are registered with their e-mail address when they accept the user license.
Downlaod
.

Texts from http://opensubtitles.org have to be downloaded from http://opus.nlpl.eu. A special script is used to align the setnences. The script, as well as further information about it's usage, are included in this packages.


When publishing results based on the texts in ParIce please refer to: Starkaður Barkarson, Steinþór Steingrímsson. 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus. Í Proceedings of the 22nd Nordic Conference on Computational Linguistics, Turku, Finnland.


Contact


References
Starkaður Barkarson, Steinþór Steingrímsson. 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus. Í Proceedings of the 22nd Nordic Conference on Computational Linguistics, Turku, Finnland.