Research data supporting “Source Sentence Simplification for Statistical Machine Translation”

Name: Research data supporting “Source Sentence Simplification for Statistical Machine Translation”
Published: 2016-10-11T16:04:35Z
Keywords: text simplification, machine translation

Hasler, Eva; de, Gispert Adrià; Stahlberg, Felix; Waite, Aurelien; Byrne, Bill

doi:10.17863/CAM.5868

Research data supporting “Source Sentence Simplification for Statistical Machine Translation”

Repository URI

https://www.repository.cam.ac.uk/handle/1810/260714

Repository DOI

https://doi.org/10.17863/CAM.5868

Files

simplification.en-de.tar.gz (380.59 KB)

Type

Dataset

Authors

Description

This data set contains subsets of English-German test sets from the Workshop for Machine Translation (WMT) which have been annotated with manual text simplification information on the source side in the form of gap begin and gap end symbols (, ). The data was tokenized and truecased using the processing scripts distributed with the Moses SMT system. The source simplifications were produced by workers recruited on the crowdsourcing platform Crowdflower (https://www.crowdflower.com). We asked workers to simplify a sentence by deleting words and punctuation, while trying to retain the most important information in the shortened sentence. Their performance was controlled using test questions and a second Crowdflower task which asked workers to identify bad simplifications from the first task. The outcomes of the second task were aggregated by combining an agreement score and the average worker trust score for each simplification. We selected randomly from the remaining simplifications with a combined score of at least 0.5.

Software / Usage instructions

text editor

Keywords

text simplification, machine translation

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Sponsorship

Engineering and Physical Sciences Research Council (EP/L027623/1)

EPSRC [EP/L027623/1]

Relationships

Supplements:

https://doi.org/10.1016/j.csl.2016.12.001

Collections

Research Data - Engineering
Symplectic mapped items for data match

Research data supporting “Source Sentence Simplification for Statistical Machine Translation”

Repository URI

Repository DOI

Files

Type

Change log

Authors

Description

Version

Software / Usage instructions

Keywords

Publisher

Rights and licensing

Sponsorship

Relationships

Collections