Taylor Mahler, Willy Cheung, Micha Elsner, David King, Marie-Catherine de Marneffe, Cory Shain, Symon Stevens-Guille y Michael White
«Breaking NLP: Using Morphosyntax, Semantics, Pragmatics and World Knowledge to Fool Sentiment Analysis Systems»
Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, Copenhagen, Denmark, September 8, 2017, Association for Computational Linguistics.
Extracto de apartados Introducción y Discusión, en páginas 33 y 39 de la publicación en PDF. Véanse las referencias en la publicación original del texto.
«Introduction
»This paper describes our submission to the 2017 EMNLP “Build It Break It” [EMNLP 2017: Conference on Empirical Methods in Natural Language Processing — September 7–11, 2017 — Copenhagen, Denmark, @emnlp2017] shared task on sentiment analysis, in which we constructed minimal pairs of sentences designed to fool sentiment analysis systems that would participate in the task. One member of the pair existed in the blind test data, and the other member was a minimally edited version of the first member designed to cause the systems to make an incorrect prediction on exactly one of the two.
»The edits were made according to four broad, linguistically interpretable strategies: altering syntactic or morphological structure, changing the semantics of the sentence, exploiting pragmatic principles, and including content that can only be understood with sufficient world knowledge. Some of our changes were designed to fool bag-of-words models, others used more complex structures to try to fool more sophisticated systems relying on parsing and/or compositional methods. Our submitted pairs broke the builder systems at a high rate (72.6%) on average, and our overall weighted F1 score as defined by the shared task (28.67) puts us in second place out of the four breaker submissions.
»Discussion
»Our results, and those of the shared task in general, serve to highlight the distance which even sophisticated, modern sentiment analysis systems have yet to cover, particularly in terms of semantic and pragmatic analysis. Moreover, changes that broke the systems were often comparatively slight; just as image classification systems can be vulnerable to adversarial examples that look very similar to the originals (?), sentiment analysis systems may be fooled by changes to single words or morphemes.
»In many cases, of course, our strategies for constructing these examples drew on previous knowledge about hard problems, for instance in parsing (?) and the detection of irony in text (?). Nonetheless, a concrete set of examples of these problems may help developers to create more robust systems in the future.
»For sets of constructed examples like ours to be useful, they should contain enough instances of each construction to reliably indicate a system’s capabilities. Looking towards the future, we hope that the next iteration of the contest will use a larger test section so that more examples can be created. Many of our strategies targeted particular constructions or idioms (for instance, right-node raising or concrete metaphors), and it was difficult to create many instances of these due to sparsity in the 521-example dataset. We found it difficult to create 100 examples as requested; in fact, two other breaker teams (including the one with the winning F-score) created only half as many.
»A related issue is that of naturalness. Although we tried to make our examples sound like real sentences from movie reviews, we had no empirical way to check how well we did. It is probably easier to break NLP algorithms with unnatural or outof-domain examples; although we hope we have not done so, in future, we would like to find better ways to make sure.»
No hay comentarios:
Publicar un comentario