junio 14, 2017

«SCAP-TT: Tagging and lemmatising Spanish tourism discourse, and beyond»



Patrick Goethals, Els Lefever y Lieve Macken
«SCAP-TT: Tagging and lemmatising Spanish tourism discourse, and beyond»

Ibérica, 33 (Spring 2017)

Ibérica | AELFE (Asociación Europea de Lenguas para Fines Específicos — European Association of Languages for Specific Purposes) | Universitat Jaume I de Castelló | Facultat de Ciències Humanes i Socials | Departament d'Estudis Anglesos | Castellón | ESPAÑA


Extracto de páginas 279, 280-281 y 286-287 de la publicación en PDF. Véanse las referencias en la publicación original del texto.




«Abstract

»In this research note we report on the first results of SCAP, the Spanish Corpus Annotation Project, applied to tourism discourse. In particular, we present and assess a new TreeTagger parameter set for Spanish (SCAP-TT), which has been trained for the Part-of-Speech tagging (POS-tagging) and lemmatisation of Spanish promotional tourism texts. Although SCAP-TT has been trained for specialized tourism discourse, we also show promising results for the annotation of other text genres such as essays and literary texts.

»Keywords: POS-tagging, lemmatisation, Spanish, TreeTagger, tourism discourse, SCAP.


»Introduction

»This research was motivated by two observations. The first of them is that in Spanish specialized discourse corpus compilation projects, POS- and lemma- annotation are not yet self-evident features. Corpora often consist of raw text, allowing for word form-based queries, but not for more abstract POS- or lemma-based queries. Regarding Spanish tourism discourse, for example, the two main corpus projects, Linguaturismo (http:// www.linguaturismo.it) and Cometval (http://www.uv.es/cometval) do not (yet) contain linguistic annotations. This observation is not intended as a criticism towards these specific projects, but rather as one example of a broader dichotomy between current practices in corpus and computational linguistics.

»The second observation is related to TreeTagger (TT, Schmid, 1994, 1995). TT is a tool for automatic POS-tagging and lemmatisation which predicts the most probable POS-tag for each word taking into account its inherent formal characteristics and the surrounding POS-context. TT can be run using the built-in parameters, but it also offers a training tool to generate new parameter sets, which means that it can be adapted and improved depending on the specific needs of a corpus project. Although the main architecture is language-independent, the output quality varies according to the language, since the tool depends on language-specific input, such as a lexicon, a tag set, a list of multi-word items or a training corpus (for technical details, see Schmid, 1994, 1995). It is generally accepted that the results for the Spanish TreeTagger are not as good as for English, for example (Göhring, 2009).

»Moreover, it should be noted that the adaptiveness of TreeTagger appears to be underused, at least for Spanish, since there are no newly trained and publicly available parameter sets for Spanish.

»Taking into consideration these observations, our aim is to use the inherent adaptiveness of TreeTagger and to make an improved parameter set for Spanish. In order to stimulate the development of annotated corpora, the parameter set is made available at the project’s website (www.scap.ugent.be).

»At the same website, readers will find further technical information, as well as advanced tools and automated applications for further processing the TT- output. In what follows, we will first briefly discuss the performance of the current Spanish TreeTagger parameter set (Standard-TT). Then, we will describe the main decisions that were taken in the development of a new parameter set (SCAP-TT), and compare the results of SCAP-TT with Standard-TT. Finally, it is important to emphasize that in this research note, we will not compare the results of TreeTagger with those of other tagging tools, such as IULA (Martínez et al., 2010), GRAMPAL (Moreno & Goni, 1995) or FREELING (Carreras et al., 2004) (see e.g. Parra & Martínez, 2015 for a recent comparison).


»Conclusion

»We have shown that SCAP-TT considerably improves the tagging and lemmatisation results of the current Spanish TreeTagger, especially but not exclusively for tourism discourse. We believe that this is an important contribution since it may reinforce the use of an already well accessible and well-known tool and, as such, contribute to integrating POS-tagging and lemmatisation into the current practice of Spanish corpus researchers. unsurprisingly, we have also found that the new tagger gives the best results for the specific discourse domain for which it is trained.»





No hay comentarios:

Publicar un comentario