Jannis Vamvas

An Encoder Model for Swiss German

2024-01-23T00:00:00+01:00

Last year, I announced SwissBERT, a multilingual encoder model that we trained on news articles from Switzerland.

At the time, we found that SwissBERT had good accuracy on Switzerland-related NLP tasks such as named entity recognition and stance detection, compared to similar models that were not trained on those data. We were especially impressed by the model's performance on Romansh input text, for which we had little training data and for which no previous language model existed.*

A limitation of SwissBERT is that it has only been trained on news articles. In Switzerland, there is a stark difference between Standard German, as used in newspapers, and Swiss German dialect. Dialect is not traditionally written, but has become ubiquitous on social media, in text messages and other informal contexts:

Super happy für min @FC_Basel 3-0 super Resultat! Das git mir so richtig lust uf min match morn im halbfinal!!! #Dangge
— Roger Federer (@rogerfederer) February 27, 2014

An ideal encoder model for Switzerland should thus be able to process written text in both Standard German and Swiss German (in addition to Romansh, French, and Italian).

Adding Swiss German to SwissBERT

In a new paper that we present at the MOOMIN Workshop on Modular and Open Multilingual NLP, we propose an updated version of SwissBERT that can do just that. We trained the model on two new datasets:

SwissCrawl (Linder et al., 2020), a collection of Swiss German web text (forum discussions, social media).
A dataset of Swiss German tweets that I collected during my master studies at LMU Munich.

We evaluated the model on three Swiss German tasks and found that adding Swiss German to the training data generally leads to a clear improvement in accuracy:

The updated model is available on the Hugging Face hub. Note that due to the data licenses, use of the model is restricted to research purposes.

Modular Adaptation through Adapters

In an earlier post, I described the modular architecture that we used for SwissBERT, called X-MOD (Pfeiffer et al., 2022). The idea is to have a single encoder model that can be adapted to different languages by adding a language adapter module for each language. The adapter is activated only when the model is processing input in the given language.

The original SwissBERT model has four language adapters for the four national languages of Switzerland (de_CH = Swiss Standard German, fr_CH = French, it_CH = Italian, rm_CH = Romansh Grischun). Adding Swiss German to the model is straightforward: We simply add a fifth adapter for Swiss German (gsw). All the other modules stay exactly the same:

We compared this strategy to a baseline where we update the Transformer layers and embeddings as well, when we train on Swiss German. We found that our modular approach reaches 97.5% of the accuracy of the baseline, while the multilinguality of the model is guaranteed to be preserved.

More Findings 🤓

In the paper, we perform some more experiments – beyond SwissBERT – which are especially relevant if you're planning to train your own model on Swiss German. Here's the highlights:

XLM-R is a surprisingly good baseline. Obviously, the multilingual model XLM-R (Conneau et al., 2020) does not work well with Swiss German, because it was never trained on text in this language. However, if we just continue the pre-training of XLM-R on our Swiss German dataset, the average accuracy is actually higher than that of the adapted SwissBERT. We release our Swiss German XLM-R model here.
Character-level modeling is good for cross-lingual retrieval. The spelling of Swiss German text is highly variable, which motivated us to also try adapting a multilingual character-level model. We adapted CANINE (Clark et al., 2022) to Swiss German and found that on part-of-speech tagging, it achieves a much lower accuracy than the subword-based XLM-R and SwissBERT models. But CANINE achieves the best accuracy on the sentence retrieval task, which to us was surprising and could be an inspiration for future work. The Swiss German CANINE model is available here.
A custom Swiss German subword vocabulary is not beneficial. Given the spelling differences between Standard German and Swiss German, one might think that a custom subword vocabulary is needed for Swiss German. However, we found that re-using the existing vocabularies of XLM-R and SwissBERT worked better. In addition to the spelling variation in Switzerland, which might make compression harder, another likely reason is that the Swiss German dataset is relatively small, so the model might not be able to learn good word embeddings from scratch.

Equipped with a Swiss German adapter, the SwissBERT model is now a more complete text encoder and covers not only the four national languages of Switzerland, but also a family of dialects that is spoken (and written) by 5 million people in Switzerland.

*You might ask: Couldn't ChatGPT do these tasks as well? Yes, maybe it could. But an encoder like SwissBERT is a lot smaller, and, as of today, much faster. One use case where efficiency is important is processing a large document collection, e.g., for creating an embedding-based search index. Such an index is often needed for enabling retrieval augmentation of large language models.

The work presented in this post is joint work with Noëmi Aepli and Rico Sennrich. It was funded by the Swiss National Science Foundation (project nos. 213976 and 191934).

References

Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. Transactions of the Association for Computational Linguistics, 10:73–91, 01 2022. URL: https://doi.org/10.1162/tacl\_a\_00448, arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00448/1985933/tacl\_a\_00448.pdf, doi:10.1162/tacl_a_00448. ↩

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. Online, July 2020. Association for Computational Linguistics. URL: https://aclanthology.org/2020.acl-main.747, doi:10.18653/v1/2020.acl-main.747. ↩

Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu Cristian Musat, and Andreas Fischer. Automatic creation of text corpora for low-resource languages from the internet: the case of swiss german. In Proceedings of The 12th Language Resources and Evaluation Conference, 2706–2711. Marseille, France, May 2020. European Language Resources Association. URL: https://www.aclweb.org/anthology/2020.lrec-1.329. ↩

Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe. Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3479–3495. Seattle, United States, July 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.naacl-main.255, doi:10.18653/v1/2022.naacl-main.255. ↩

Wenn ChatGPT den Smartvote-Fragebogen ausfüllt

2023-08-23T00:00:00+02:00

Mit der zunehmenden Nutzung von Sprachmodellen im Alltag wird auch die Frage relevant, ob diese eine politische Schlagseite haben. Die Tamedia-Zeitungen haben Anfang der Woche über eine Studie berichtet, die genau das untersucht hat. In dieser Studie wurde ein englischsprachiger Fragebogen (Political Compass) verwendet, um Sprachmodelle auf ihre politische Ausrichtung zu testen.

Ein Ergebnis war, dass ChatGPT tendenziell links der Mitte zu verorten ist. Allerdings ist der verwendete Fragebogen – wie vieles in der Politik – nicht ganz unumstritten. In der Schweiz gibt es mit Smartvote seit 20 Jahren einen Fragebogen, der sich bewährt hat, und heute wurde eine neue Version für die eidgenössischen Wahlen im Oktober 2023 aufgeschaltet. Jetzt bietet es sich an, ChatGPT auch diesen Fragebogen ausfüllen zu lassen.

Der neue Fragebogen vom Smartvote enthält 75 Fragen, von Umwelt und Energie über gesellschaftliche Fragen bis zum Bundeshaushalt. Smartvote erstellt anhand des Fragebogens eine Grafik mit grossem Wiedererkennungswert, genannt Smartspider. Eine weitere Besonderheit von Smartvote ist die Mehrsprachigkeit. Diese hat mich vor drei Jahren schon zu einem anderen computerlinguistischen Experiment inspiriert.

Um ChatGPT den Smartvote-Fragebogen beantworten zu lassen, muss man kein Experte sein. Die Idee ist auch nicht komplett neu. In diesem Blogpost möchte ich die Idee aber etwas genauer anschauen:

Ich vergleiche ChatGPT mit dem aktuell grössten Open-Source-Sprachmodell (LLaMA 2), welches weniger einfach zugänglich ist als ChatGPT.
Ich stelle die Fragen in allen vier Landessprachen, plus Englisch. Interessanterweise hat die Sprache einen Einfluss auf das Resultat – was verschiedene Gründe haben könnte.
Ich erkläre, warum mehrfaches Generieren von Antworten oder eine Analyse der Wortwahrscheinlichkeiten wichtige Methoden sind, um die Tendenzen eines Sprachmodells zu erfassen.

ChatGPT vs. LLaMA 2

ChatGPT ist ein kommerzielles Sprachmodell von OpenAI, das von registrierten Nutzer*innen kostenlos benutzt werden kann. Leider sind die technischen Details von ChatGPT nicht öffentlich zugänglich. Darum ist es nützlich, auch Open-Source-Modelle zu haben, die man genauer unter die Lupe nehmen kann.

Ein solches Open-Source-Modell ist LLaMa 2 von Meta. Ich lege den Smartvote-Fragebogen sowohl ChatGPT als auch LLaMa 2 vor, und zwar der grössten Version von LLaMa mit 70 Milliarden Parametern, welche für Chat-Applikationen optimiert wurde.

Ein erster Vergleich zeigt, dass beide Modelle eine ähnliche Smartspider-Grafik erhalten:

Beide Spiders sind stark eingemittet. Das muss aber nicht heissen, dass die Modelle in ihren Antworten übereinstimmen: Nur auf 38 von 75 Fragen geben sie die gleiche Antwort. Vielmehr neigen die Modelle dazu, gemässigte Antworten zu geben ("eher ja", "eher nein"):

Einen ausgeprägten Smartspider erhält aber nur, wer hin und wieder auch entschiedene Antworten gibt ("ja", "nein").

Spielt die Sprache eine Rolle?

Bis jetzt habe ich die Fragen nur auf Deutsch gestellt. Sehen die Smartspiders auch gleich aus, wenn die Fragen auf Französisch, Italienisch oder Rumantsch gestellt werden, oder auf Englisch? Für diese Sprachen stellt Smartvote eine Übersetzung der Fragen und Antwortoptionen zur Verfügung. Wenn man diese dem Sprachmodell vorlegt, zeigt sich, dass zum Teil recht verschiedene Smartspiders herauskommen:

Der erste Gedanke ist sicher, dass die Sprache und damit auch der Kulturraum einen Einfluss auf die Antworten des Sprachmodells hat. Oft spricht man in der Schweiz vom Rösti- und Polentagraben, der die verschiedenen Sprachregionen auch politisch trennt. Konkret könnte ich mir vorstellen, dass italienischsprachige Texte, auf denen ChatGPT trainiert wurde, im Durchschnitt leicht andere Meinungen propagieren als etwa Texte in deutscher Sprache.

Allerdings kann ich mir auch einen profaneren Grund vorstellen: Zufall. Vielleicht werden die Sprachmodelle von Oberflächlichkeiten beeinflusst – zum Beispiel von der Länge der Fragen, der Reihenfolge der Antwortoptionen und von der Wortwahl. Eine solche fehlende Robustheit ist in der Computerlinguistik oft zu beobachten.

Um das zu überprüfen, könnte man die Fragen umformulieren und schauen, ob das Modell dann eine andere Meinung ausgibt. Das wäre ein interessantes Experiment für die Zukunft. Würde sich ein Mangel an Robustheit bestätigen, dann wäre es in meinen Augen nicht mehr korrekt, von der "Meinung" oder von der "politischen Ausrichtung" eines Sprachmodells zu sprechen.

Stattdessen wäre der Smartspider von ChatGPT einfach ein Zufallsprodukt, das Ergebnis eines Prozesses, der gar nichts mit Politik zu tun hat. Trotzdem können Sprachmodelle eine Bias oder Voreingenommenheit aufweisen. Nur könnte man diese nicht anhand eines politischen Fragebogens quantifizieren.

Zur Methodik

Grundsätzlich kann man eine Frage einfach ins Chatfenster eingeben und versuchen, die Antwort in das Schema "ja", "eher ja", "eher nein", "nein" einzuordnen:

Dieser Ansatz hat aber zwei Nachteile:

Oft verweigert das Sprachmodell eine entschiedene Antwort – schliesslich wurde es vom Hersteller dazu trainiert, bei heiklen Fragen auf der sicheren Seite zu bleiben.
Die Antwort, die wir bekommen, ist für das Modell gar nicht unbedingt die allerwahrscheinlichste Antwort. Stattdessen wird die Antwort durch eine Stichprobe erzeugt, bei der auch ein bisschen Zufall im Spiel ist. Wenn man die Frage noch einmal stellt, bekommt man vielleicht die gegenteilige Antwort.

Die Methode, die ich für diesen Blogpost gewählt habe, ist daher etwas komplizierter. Ich weise das Sprachmodell an, entweder mit A, B, C oder D zu antworten. Ich teile dem Modell mit, dass A "ja" heisst, B "eher ja" heisst, etc. Dann vergleiche ich, welche Wahrscheinlichkeiten das Modell den Wörtern zuweist: Ist "A", "B", "C" oder "D" am wahrscheinlichsten als nächstes Wort? Das kann man nicht via das Chatfenster machen, sondern man braucht Programmcode wie zum Beispiel das Projekt LMQL von der ETH Zürich.

Man könnte es auch so formulieren, dass ich das Sprachmodell zwinge, eine der vorgegebenen Antworten zu generieren. Andere Antworten als A, B, C oder D sind nicht erlaubt (forced choice).

Im Fall von ChatGPT ist es noch nicht möglich, die Wahrscheinlichkeiten der einzelnen Wörter zu analysieren. Dafür ist die API von OpenAI im Moment nicht geeignet. Ein äquivalenter Ansatz ist es, eine Antwort viele Male neu generieren zu lassen. Wenn man dann zählt, wie oft "A", "B", "C" und "D" herauskommt, kann man die zugrundeliegenden Wahrscheinlichkeiten abschätzen. Das habe ich für ChatGPT so gemacht.

FAQ

Wie lauten die Fragen genau?

Der Fragebogen ist auf der Website von Smartvote verfügbar.

Wie wird der Smartspider berechnet?

Der Smartspider wurde von Smartvote so definiert, dass ein Teil der Fragen einer oder mehreren Achsen zugeordnet wird. Die genaue Zuteilung ist in einem PDF dokumentiert.

My PhD Thesis Is Out! (A Summary)

2023-04-05T00:00:00+02:00

After quite a few months of writing and polishing, my PhD thesis is now available on the e-print repository of my university. And a few weeks ago, I have successfully defended the thesis in an oral examination:

Congratulations to @j_vamvas , who just passed his viva on "Model-based Evaluation of Multilinguality"! With thanks to the examiners @unattributed and @LenaAJaeger . pic.twitter.com/vhwp8CtTWy
— Zurich Computational Linguistics Group (@cl_uzh) March 13, 2023

My thesis combines four research papers published in 2021 and 2022. All the papers were co-authored with my main supervisor, Rico Sennrich. Earlier, I have summarized the papers on this blog: 1, 2a and 2b, 3, and 4.

What I added to the thesis was a 40-page introduction with some additional context. The introduction characterizes the key problem: how to evaluate modern natural language processing (NLP) systems in multiple languages. Many of these systems – like GPT, DeepL or Google Translate – are designed to handle multilingual input. Evaluation means figuring out how well the systems do in comparison to each other and to humans.

Good evaluation practice is necessary for multiple reasons: for research and development (doing experiments and measuring the effect of changes), for real-world applications (ensuring safety) and for society at large (understanding when NLP systems are working well and when they are failing). But multilinguality remains a great challenge, mostly because there are so many languages in the world, but also because resources, including human resources, are not equally available for all languages.

Most contributions in the thesis focus on targeted evaluation, where a specific aspect of system quality is examined. There is a lot of previous work on targeted evaluation; for example, the WinoMT challenge (Stanovsky et al., 2019) specifically looked into occupation nouns like ‘doctor’ and their translation in terms of gender. Current machine translation (MT) systems have an overgeneralization bias and tend to resort to whatever has been the majority gender in the training data, often ignoring the gender information in the input sentence:

An English–German example for WinoMT. I have annotated the gender of ‘doctor’ and its translations with emoji. Note that this is not a shortcoming of DeepL in particular – other MT systems tend to do similarly bad.

But the idea behind WinoMT poses a challenge to evaluation as well: How can we account for the fact that there are many good translations of these sentences, and still reliably judge whether the occupation noun has been correctly translated in terms of gender? My thesis discusses different methods for targeted evaluation and presents novel experiments that highlight limitations in previous methods (Vamvas and Sennrich, 2021). We then propose a new model-based targeted evaluation method called Contrastive Conditioning.

The idea behind Contrastive Conditioning is to classify the machine translation using another MT system. We can delegate the evaluation process to that system if we provide it with extra information via an augmented source sequence:

Model-based evaluation allows us to scale the evaluation across multiple target languages, without having to create custom test sets for every target language. Another advantage of our method is that anyone can automatically analyze the translations from black-box systems like DeepL. Our method does not require access to the system that has generated the machine translations.

In the thesis, we demonstrate how the method can be applied to the problems of (1) measuring lexical overgeneralization bias in MT (like WinoMT) and showing that distilled translation models overgeneralize more strongly (Vamvas and Sennrich, 2021), and (2) detecting coverage errors in MT, e.g., detecting whether information has been lost in translation (Vamvas and Sennrich, 2022).

A recurring idea in the thesis is that MT systems can be useful beyond translation: as a model of multilinguality and of semantic equivalence across languages. A perfect example is the method described above, since Contrastive Conditioning makes use of MT to automate targeted evaluation. But in our last paper, we looked at the idea from a different angle and demonstrated different ways of how an MT system can be queried for estimating semantic similarity (Vamvas and Sennrich, 2022).

Specifically, MT systems can judge whether two sentences are paraphrases of each other (Thompson and Post, 2020). This is especially interesting if phrases seem to look similar but have an opposite meaning, as in “Flights from New York to Florida” vs. “Flights from Florida to New York”. In that case, we showed that MT-based approaches, such as our proposed translation cross-likelihood measure, are much more accurate than alternative approaches:

Accuracy of different text similarity measures on paraphrase identification (on average across test sets in 9 languages). The accuracy of translation-based measures can be increased by applying a normalization (dark red), such as reconstruction normalization.

Overall, I see my thesis as a contribution towards extending the range of technical possibilities in multilingual NLP evaluation. But there are still many limitations, including fundamental ones. Seventy years ago, Warren Weaver (1894–1978), an influential technologist at the dawn of the computer age, put forward four principles that he saw as crucial for advancing NLP (Weaver, 1952). In the introduction to my thesis, I revisit his memorandum and find that three of his principles have been realized by now, in some form or other. The principles envisioned by Weaver – contextualization, machine learning and information theory – are now closely reflected in the state of the art of NLP.

However, his memorandum concludes with a fourth principle: multilinguality. And while this idea has inspired research ever since, the terms in which modern NLP systems can be understood as being truly multilingual are still unclear. Common to the methods presented in this thesis is a functionalist approach – crafting inputs and observing the outputs of NLP systems. To bring Weaver’s Fourth Principle to fruition and to verify its success, a functionalist approach might not be enough for NLP.

The work presented in this post was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727). I thank the members of the supervisory and doctoral committees for their valuable feedback.

References

Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1679–1684. Florence, Italy, July 2019. Association for Computational Linguistics. URL: https://aclanthology.org/P19-1164, doi:10.18653/v1/P19-1164. ↩

Brian Thompson and Matt Post. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 90–121. Online, November 2020. Association for Computational Linguistics. URL: https://aclanthology.org/2020.emnlp-main.8, doi:10.18653/v1/2020.emnlp-main.8. ↩

Jannis Vamvas and Rico Sennrich. Contrastive conditioning for assessing disambiguation in MT: A case study of distilled bias. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 10246–10265. Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL: https://aclanthology.org/2021.emnlp-main.803, doi:10.18653/v1/2021.emnlp-main.803. ↩

Jannis Vamvas and Rico Sennrich. On the limits of minimal pairs in contrastive evaluation. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 58–68. Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL: https://aclanthology.org/2021.blackboxnlp-1.5, doi:10.18653/v1/2021.blackboxnlp-1.5. ↩

Jannis Vamvas and Rico Sennrich. As little as possible, as much as necessary: detecting over- and undertranslations with contrastive conditioning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 490–500. Dublin, Ireland, May 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.acl-short.53. ↩

Jannis Vamvas and Rico Sennrich. NMTScore: a multilingual analysis of translation-based text similarity measures. In Findings of the Association for Computational Linguistics: EMNLP 2022, 198–213. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.findings-emnlp.15. ↩

Warren Weaver. Translation. In Proceedings of the Conference on Mechanical Translation. Massachusetts Institute of Technology, 17-20 June 1952. URL: https://aclanthology.org/1952.earlymt-1.1. ↩

Introducing SwissBERT

2023-03-24T00:00:00+01:00

Self-supervised text encoders such as BERT are an important tool for natural language processing (NLP) applications. You won’t see them chatting away on the web like some large-scale generative models. But they are useful for NLP practitioners because they can be trained with mere billions, rather than trillions, of words, and they lend themselves to supervised fine-tuning.

After the original BERT model was released for English, many others were created, like CamemBERT for French and GilBERTo for Italian. Now, a team at the University of Zurich is adding another one to the list: We release SwissBERT, the multilingual language model for Switzerland.

SwissBERT supports Swiss Standard German, French, Italian and Romansh Grischun. It might even be extended to Swiss German dialects in the future.

Why wasn’t there a unified model for the Swiss national languages until now? A reason is that Switzerland is a multilingual country, and multilinguality is still a challenge in NLP. While there are plenty of training data available on the web for German, French, and Italian, there are little data for Romansh. Making sure that the higher-resource languages do not drown out the other languages in the model is non-trivial.

Another consideration is that there are already open-source models for three out of the four languages. Ideally, one could somehow combine these existing resources, adapt them to the peculiarities of Switzerland and add Romansh to the mix. Figuring out how to apply such a “Swiss Finish” has been another challenge.

To tackle these challenges, we used an approach from the recent literature: language-specific model components, or simply: language adapters. An advantage of language adapters is that each language has a reserved module of equal capacity. Each language adapter is activated only if the model processes input in the given language.

Specifically, we based SwissBERT on a massively multilingual model, X-MOD, which has been pre-trained with language adapters from scratch by Pfeiffer et al. (2022):

We trained the existing German, French and Italian adapters as well as a new Romansh adapter on 21 million news articles from Switzerland. Testing out two design variants, we found that a custom vocabulary and custom trained word embeddings (Variant 2 on the right) are better suited for the Swiss national languages than those of the massively multilingual X-MOD.

The 21 million news articles have been retrieved from Swissdox@LiRI, which provides access to many newspapers in the Swiss Media Database (SMD). Thus, rather than crawling the web, we had the chance to use a high-quality and clearly defined corpus for pre-training.

After ten passes through the pre-training corpus, we evaluated SwissBERT on a range of NLP tasks related to Switzerland. We first wanted to see how well it handles the sort of text it has been pre-trained on: contemporary news from Switzerland. We also looked into slightly different text domains to gauge the general performance of the model.

To evaluate our model on named entity recognition (NER), we created a collection of small-scale test sets for the four national languages. Below is an example for the expected output of NER:

We were pleased to see that SwissBERT clearly outperforms the baselines on our test sets, both in terms of supervised NER (German, French and Italian) and zero-shot cross-lingual transfer (Romansh):

Another nice result is that SwissBERT performs unsupervised alignment of Romansh text to German text (Dolev, 2022) much more accurately than previous models, which have not been trained on Romansh: In other words, SwissBERT is good at comparing Romansh sentences to German sentences and identifying similar and dissimilar words and phrases. The same probably holds for other language combinations.

We also evaluated on cross-lingual classification of political comments (which works well) as well as NER for historical newspapers (which does not work too well with SwissBERT). The complete results are documented in our paper pre-print.

Given the encouraging results, we hope that our model can support researchers who need to analyze large amounts of contemporary written text in one of the Swiss national languages. Due to the nature of the pre-training corpus, we release the model with the CC BY-NC 4.0 license for now. This means that it can immediately be used by academic researchers, but not (yet?) for commercial applications.

The work presented in this post is joint work with Johannes Graën of UZH's Linguistic Research Infrastructure (LiRI) and my PhD supervisor Rico Sennrich. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).

References

Eyal Dolev. Using multilingual word embeddings for similarity-based word alignments in a zero-shot setting. tested on the case of German–Romansh. Master's thesis, Department of Computational Linguistics, University of Zurich, 2022. ↩

Translation Puzzles are In‑context Learning Tasks

2022-12-05T00:00:00+01:00

A research preview of OpenAI's ChatGPT has received a lot of attention. The positive public reaction seems well-deserved, given that the system is not only a state-of-the-art language model, but has also been carefully fine-tuned based on human feedback. As a result, ChatGPT's answers seem a bit more useful than the output of previous language models, even though the system has clear limitations.

When I tested ChatGPT myself last week, one of the things I tried was difficult translation puzzles. Here is an example of such a puzzle, taken from a paper by Şahin et al. (2020):

The puzzle has originally been created by Tom Payne for a Linguistic Olympiad, where students from around the world compete on linguistic tasks. Translation puzzles are very challenging for common natural language processing algorithms. Şahin et al. (2020) have demonstrated this in their PuzzLing Machines benchmark, where the highest accuracy reached by any algorithm has been 3.2%.

The reason why the algorithms cannot solve the puzzles is that they are not really designed for this task. Machine learning does not favor tasks where there are very few examples but each example requires intensive analysis. Neural networks in particular are usually trained on tens of millions of example sentences, rather than just seven sentences. A friend of mine, Antonio Bikić, has compared this phenomenon to the "Dutch Disease" in economics: The abundance of data in natural language processing has led researchers to neglect methods that could use data in an intensive, rather than extensive, manner.

Now let's see how ChatGPT handles the puzzle. The top part is my question and the bottom part is ChatGPT's reply:

ChatGPT has translated the first two sentences correctly. The third translation is not quite correct; according to the reference translation it should be "Holloli" instead of "Hollo." Nevertheless, this looks like a great start.

Testing on the full benchmark

ChatGPT might have seen this particular puzzle during training since it is included as an example in the benchmark paper. So I've asked ChatGPT to also solve the 142 puzzles from the PuzzlingMachines test set, for which there are no solutions on the web. Some of these require ChatGPT to translate from English to another language, and some are variants where translations into English need to be created. Here I report the average score for the two directions.

In terms of ChrF, which is a metric that calculates the character overlap to the reference translation, the results are as follows:

While baselines such as Phrase-based Statistical MT achieve up to 40.2%, ChatGPT reaches 65.9%.

The other metric I've looked at is the ratio of exact matches. This metric is lower than ChrF because partially correct translations do not receive any credit here. As a result, previous baselines have achieved only little more than zero accuracy:

ChatGPT reaches 23.9% exact-match accuracy. Most of its answers are still wrong, but it performs much better than the previous baselines.

Reasons for the high accuracy

Why does ChatGPT do so much better when solving translation puzzles? It seems to me that the way ChatGPT works is a very good fit for these translation puzzles.

First of all, ChatGPT avoids repetition. Repetition is a notorious problem in text generation, and so the developers have probably put in some guardrails against this behavior. Avoidance of repetition is also a property of translation puzzles. Each test sentence in a translation puzzle somehow reuses the vocabulary from the examples in a previously unseen way. As long as ChatGPT tries to do something with the input while not repeating it verbatim, it will likely get some answers right.

Another consideration is that ChatGPT has probably seen many examples of the non-English languages during training. Most of the languages in the translation puzzles have few speakers, such as Chickasaw from the initial example, which has 75 speakers according to Wikipedia. But a few puzzles also involve languages with many speakers, such as Polish or Greek. This might allow ChatGPT to translate some test sentences without even looking at the example sentences.

However, the most important advantage of ChatGPT is probably an idea called in-context learning. The previous approaches have divided the puzzle into two phases: First, a statistical model is trained on the example sentences. Then, that model makes a prediction for each test sentence.

In contrast, ChatGPT can process the puzzle as a whole. All the example sentences and all the test sentences are part of the context provided to the model. When it generates an answer, the language model can attend to all the relevant parts of the context simultaneously.

It is a matter of debate whether this attention to the context can be seen as a form of learning. Irrespective of what it is called, the phenomenon has inspired interesting approaches to dealing with little training data, especially if similar patterns have already been in the training data of the language model (Brown et al., 2020).

Unfortunately, it will be difficult to analyze these individual factors as long as the training details of ChatGPT, as well as the model weights and code, are not public. I hope that other institutions will be able to replicate ChatGPT in an open form, as has happened with other breakthroughs in the past.

Other observations

In one case, the system declined to give a solution and replied:

It is not possible to provide accurate translations for the test sentences without more information about the language they are written in and the context in which they are used. The sentences provided are not in a recognizable language and do not follow any discernible grammar or syntax rules, so it is impossible to determine their meaning or provide translations for them.

I clicked the "Try again" button and used the answers from the second attempt.

In another instance, ChatGPT offered a solution but added the following note:

Note: These translations are based on the example sentences provided and may not be correct. It is always best to consult a qualified translator or language expert for accurate translations.

On the one hand, this kind of hedging is typical for ChatGPT, and will certainly increase the safety of the system. On the other hand, it is not quite clear why it would happily have a go at a puzzle in 99% percent of the cases, and defer to a linguist in the other 1%.

Finally, an important difference in the way the puzzles are used at a Linguistics Olympiad is that the participants are expected to provide an explanation for their solution. As Şahin et al. (2020) mention, the students will still get some points if their approach is valid. When I asked ChatGPT to explain its solution, it told me it was "happy to explain" and wrote five paragraphs with a lot of linguistic jargon. Needless to say that the explanation was all wrong.

References

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, 1877–1901. Curran Associates, Inc., 2020. URL: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. ↩

Gözde Gül Şahin, Yova Kementchedjhieva, Phillip Rust, and Iryna Gurevych. PuzzLing Machines: A Challenge on Learning From Small Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1241–1254. Online, July 2020. Association for Computational Linguistics. URL: https://aclanthology.org/2020.acl-main.115, doi:10.18653/v1/2020.acl-main.115. ↩ ¹ ² ³

Three Diffusion Digressions

2022-09-18T00:00:00+02:00

It was a nice coincidence that the Stable Diffusion model was released right before my vacation. While DALL·E 2 had popularized image generation with diffusion before, few people have access to it. In contrast, Stable Diffusion is open source, and hundreds of developers and artists all over the world have started to experiment with it and have been sharing their creations online. Since I now had a week of free time at hand, I decided to explore the model on my own. In this blogpost, I document the three mini-inventions that I came up with.

Hypermosaics

photo of a turtle all the way down #stablediffusion pic.twitter.com/JlHyHTscxq
— Jannis Vamvas (@j_vamvas) September 11, 2022

I think many people have heard of photomosaics, which go back to the nineties. People who have tried creating a photomosaic will also know how difficult it is. The goal of a photomosaic is to compose a primary image out of a large number of component images. Because all these images are pre-defined – e.g., they are photos made by human photographers – the process is computationally complex and usually requires a few tricks, such as manipulating the color of the component images.

In contrast, creating a photomosaic using images generated by Stable Diffusion is straightforward. Let's start with creating the primary image. In this example, I am using the prompt “photo of a turtle” throughout.

Given a primary image, we can then generate the component images top-down. Let's divide the image into tiles (I am using a 64×64 grid).

We can then upsample each tile to 512×512px and use it as an input for generating another photo of a turtle. As you can see in the example above, this process (called img2img) preserves the color gradient in the original tile, which is important for rendering the details of the primary turtle.

As a result, a mosaic generated with Stable Diffusion can be much more detailed than a traditional photomosaic.

You could argue that a mosaic made out of generated images is somewhat pointless, and I agree. The reason image generators are interesting is that it is now possible to create what I call a hypermosaic, which would be very difficult without an image generator.

A hypermosaic is an image that is composed of other images, which in turn are composed of other images, and so forth – a hypermosaic has infinite resolution! To stick to the turtle example: A hypermosaic is “photo of a turtle” all the way down.

With some manual stitching I was able to create a proof of concept within hours based purely on generated images of turtles. The looped video in the tweet above is the outcome (here is a 10MB GIF). Let me know in the comments what you think. My personal opinion: It could make a nice screensaver!

Prompt Ensembling

Not sure if people have already been doing that, but I find it amazingly easy to ensemble #stablediffusion conditioned on two different prompts.

Can you guess the two prompts behind these photos? pic.twitter.com/pkyuwzRPce
— Jannis Vamvas (@j_vamvas) September 13, 2022

When people explore the capabilities of image generators, they often try to combine concepts that are rarely seen together (chair+avocado or astronaut+horse). However, Stable Diffusion cannot combine all concepts out of the box. For example, this is what you get for the prompt “photo of a giraffe that looks like a frog”:

Not too impressive, right? You get something similar if you put “photo of a mixture of giraffe and frog”:

Thus, mixing giraffes and frogs calls for a more hands-on approach. A concept that is well-known in machine translation is to combine multiple models at inference time (ensembling). When working with a single model, one can also combine multiple instances of the same model that are each provided with a different input. Here this would mean to combine an instance of the image generator that is conditioned on ”photo of a giraffe“ with one that is conditioned on ”photo of a frog“.

This approach can also be understood as an interpolation of two conditional models, but to avoid confusion with prompt interpolation I will use the term ensembling.

Let's look at a nice schema of Stable Diffusion created by Hugging Face:

Image source

According to the schema, conditioned latents are iteratively created by conditioning a UNet on the embedded user prompt and a previous latent. A straightforward way to ensemble two prompts would then be to average the two conditioned latents at each step before feeding them back into the UNets:

Here is an implementation of this idea that I made using the Hugging Face diffusers library. I just needed to add 18 lines.

If we now ensemble the prompts “photo of a giraffe” and “photo of a frog” (with default settings and for 80 steps), this is what we get:

The ensembling works in the sense that both prompts are represented in the images. However, the model has a hard time combining them in a meaningful or creative way. In the upper right image, we just see a frog next to a tiny giraffe (which could be seen as a local minimum of what we want to achieve). So it seems we need to help the model further to converge to a unified object.

We can do this by ensembling the prompts “photo of a giraffe that looks like a frog” and “photo of a frog that looks like a giraffe” (which we had used individually in the first example). Now finally we are receiving acceptable images, some of which could even be considered useful and interesting:

Picture Frame Inpainting

To conclude this post, I would like to share my attempt to put Stable Diffusion into real-world use.

An idea that has achieved quite a lot of attention is inpainting, i.e., completing a region inside the image by using the remainder of the image as context. For example, there is a popular web UI for Stable Diffusion that supports inpainting, and there are plugins for various graphics editors.

I decided to try out inpainting by inpainting the inside of actual picture frames:

For people looking to experiment with #stablediffusion, I can truly recommend the Krita plugin by @NicolayMausz

Below: Inpainting picture frames of various styles. Prompt: painting of a dog pic.twitter.com/9OFQ48hzrN
— Jannis Vamvas (@j_vamvas) September 7, 2022

My hope was that the image generator would ”attend“ to the surroundings of the picture frame when inpainting the image into the frame. It is difficult to estimate to what degree this actually happens, and my experiments were not always successful. However, in the (cherry-picked) images above, there are some indications that the image generator considers the context sometimes. For example, on the upper right the dog's colors match the photo from the IKEA product catalog (original image). And on the lower left, the dog's hair might reflect the wooden floor of the apartment.

As a final digression I decided to make a step into the physical world and to let Stable Diffusion inpaint an image into my own apartment. There was a picture frame that had long been empty:

Using the prompt “drawing of an animal”, I then inpainted the frame virtually.

After a few tries I got an acceptable result. I was not completely satisfied with the colors, because many of the inpaintings were gray or had a very faint green. I suspected this was caused by the white bookshelf and the wall, but was not able to get to the root of this phenomenon.

In order to have an actual art print, I had to upscale the inpainted image.

I scaled and padded the image to 512×512 and used the img2img function of Stable Diffusion to create a full-resolution image. For that I used the prompt “impressionist painting of a small deer hiding inside vegetation, high-resolution art print”.

In a second step, I upscaled the image with ESRGAN to get to about 240 DPI. The print is now standing on my bookshelf and, I hope, will decorate the apartment for many years to come.

Lost and Found in Translation

2022-05-25T00:00:00+02:00

This blog post is a brief introduction to a paper presented at ACL 2022, titled "As Little as Possible, as Much as Necessary: Detecting Over- and Undertranslations with Contrastive Conditioning".

Coverage errors in MT

Neural machine translation (NMT) has greatly improved over the last years, but there are still a few typical errors that afflict NMT even in high-resource language pairs. One common error type is addition or omission of content, which is also sometimes called overtranslation or undertranslation, or simply an error of incomplete coverage.

For example, a commercial MT system has recently translated the following English sentence into German (source):

German speakers can confirm that the output sounds fluent. However, the phrase “reeling from low oil prices” has not been translated, and so a crucial piece of information is missing in the translation.

Contrastive Conditioning

In a short paper presented at ACL 2022, we propose a new method for automatically identifying such coverage errors.

The approach that we use is contrastive conditioning, and I have written about it in a previous blog post. We had developed it originally to detect word sense disambiguation errors.

Our idea was to score a translation conditioned on contrastive source sequences. If the translation looks probable for a given source sequence, the latter can tell us something about the translation:

In other words, we try to infer properties of a translation by trying different hypothetical source sequences and checking which ones are most plausible for the translation. This can be done automatically, using an off-the-shelf NMT system.

Minimal Example

In this paper, we use contrastive conditioning to detect coverage errors. Let me demonstrate this on the example of a short, made-up translation:

This translation contains an omission error: The phrase “after landing” seems to have been lost in translation.

Our idea is that this could be detected by conditioning the translation on another, hypothetical source sequence. Specifically, we expect that the source sequence “Please exit the plane” should have a higher likelihood than the actual one.

So let’s take an open-source NMT model and verify:

Indeed, the NMT model assigns a higher probability to the hypothetical source sequence that does not contain “after landing”.

Note that we could use any NMT model for this. It could be the same system that created the translation, but does not have to be.

Searching for Errors

After this proof of concept, let's try if we can spot other omission errors with this method – beyond this made-up example.

When analyzing a translation, we perform an exhaustive search over all the phrases that might be missing in the translation, and we compare the translation probability conditioned on a partial source sequence (which the phrase removed) to the probability conditioned on the original source sequence:

I have grayed out partial sources that can be skipped, since these parts of speech are unlikely to give useful results.

(For example, the article “the” is not a content word, and it is unlikely that there is a coverage error involving just an article. After all, we are mainly interested in so-called constituents, which are sometimes defined as word spans that can be removed from a sentence without rendering it ungrammatical. We approximate the concept of constituents by creating a dependency tree and only selecting nodes that meet certain conditions.)

Checking for addition errors can be done in an analogous way:

We estimate the reverse probabilities using an NMT model that translates in the reverse direction. In the above example, no partial source has a higher likelihood that the original source, since the translation does not contain an addition error.

Real-world Example

Finally, let's apply the algorithm to the real-world example I mentioned at the beginning of this post, where “reeling from low oil prices” was missing in a lengthy sentence. I'm going to use our Python implementation:

from coverage.evaluator import CoverageEvaluator
from translation_models import load_forward_and_backward_model

forward_model, backward_model = load_forward_and_backward_model("mbart50", src_lang="en", tgt_lang="de")

evaluator = CoverageEvaluator(
  src_lang="en",
  tgt_lang="de",
  forward_evaluator=forward_model,
  backward_evaluator=backward_model,
)

src = "The government, reeling from low oil prices, says it hopes tourism will contribute up to 10 percent of the gross domestic product by 2030, compared to three percent currently."
translation = "Die Regierung hofft, dass der Tourismus bis 2030 bis zu 10 Prozent des Bruttoinlandsprodukts ausmachen wird, verglichen mit derzeit drei Prozent."

result = evaluator.detect_errors(src, translation)
print(result)

# Omission errors: reeling from low oil prices | from low oil prices | low | oil

Looking at the output, it seems that the missing phrase is correctly identified.

Evaluation Results

In the paper, we describe how we evaluated our approach on a dataset of real-world machine translation errors created by Freitag et al. (2021). We also perform a human evaluation of word-level precision, in order to better understand what our algorithm gets right and when it fails. The language pairs we evaluate on are English–German and Chinese–English.

As a supervised baseline we use a token classification system (based on XLM-Roberta) that outputs whether a source token is omitted in the translation, and whether a target token is an addition error. This approach is based on previous work on token-level quality estimation and was implemented with OpenKiwi (Kepler et al., 2019). We trained the supervised baseline on a large-scale dataset of synthetic coverage errors. In the paper we describe in more detail how we created this dataset, and we release it alongside the code on GitHub.

On the segment level we find that our algorithm has similar or higher accuracy than the supervised baselines. It is especially accurate for omission errors:

Regarding addition errors, the accuracy of both methods is likely too small to be helpful. But there are fewer positive examples of addition errors in the dataset, which makes it difficult to achieve a high accuracy.

The human evaluation shows us that the word-level precision is comparable to the segment-level accuracy. And it turns out a portion of the detected word spans are actually different types of errors:

The light blue area in the figure means that a detected word span is indeed a translation error, but it is e.g. an accuracy error and not a coverage error. There is also a large amount of false positives, where our human annotators did not find anything wrong with the highlighted word spans, especially with addition errors. In our paper, we show examples for these phenomena.

Summary

We have demonstrated a reference-free method to automatically detect coverage errors in translations. Specifically, our method relies on hypothetical reasoning using contrastive conditioning.

An advantage of our approach is that it does not require a specifically trained model, such as a quality estimation model. Instead we use an off-the-shelf NMT model, which could also be the model that originated the translation in the first place.

Given the encouraging accuracy on omission errors, it would be interesting to see user studies on whether their automatic detection could aid translators and post-editors. On the other hand the detection of addition errors seems to be more challenging, and is still an open problem.

The work presented in this post is joint work with my PhD supervisor Rico Sennrich. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).

References

Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9:1460–1474, 12 2021. URL: https://doi.org/10.1162/tacl\_a\_00437, arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00437/1979261/tacl\_a\_00437.pdf. ↩

Fabio Kepler, Jonay Trénous, Marcos Treviso, Miguel Vera, and André F. T. Martins. OpenKiwi: an open source framework for quality estimation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 117–122. Florence, Italy, July 2019. Association for Computational Linguistics. URL: https://aclanthology.org/P19-3020, doi:10.18653/v1/P19-3020. ↩

NMTScore: Text Similarity via Translation

2022-04-29T00:00:00+02:00

While neural machine translation (NMT) is mainly being used for translating text, it is also useful for comparing text. This bonus feature of NMT is promising for some areas where attention to detail matters. We have released a new Python library as well as a paper accepted to Findings of EMNLP 2022 that compares translation-based similarity measures to baselines such as sentence embeddings.

In this post I summarize our findings.

The concept of translation probability

At the core of an NMT system, there is a translation model that estimates the probability of any translation, given the source sequence. For example, a good translation model will tell us that "Bonjour" can probably be translated into English as "Hello", but that "Goodbye" would be an improbable translation.

Let me demonstrate this using our Python library, NMTScore. First, let's download an open-source NMT model from HuggingFace:

from nmtscore.models import load_translation_model
model = load_translation_model("m2m100_418M")

This model, called M2M100, has been released by Fan et al. (2021). Let's ask the model to estimate some translation probabilities for us:

model.score("en", ["Bonjour !"], ["Hello to you!"])
# [0.35]
model.score("en", ["Bonjour !"], ["Sleep well!"])
# [0.11]

The first argument tells the model about the target language. Of course, if the target language were German instead of English, the translation would become less probable:

model.score("de", ["Bonjour !"], ["Hello to you!"])
# [0.04]

Multilingual translation models

Which brings us to the concept of a multilingual translation model. Multilingual models are trained jointly on multiple source languages and/or target languages. For example, M2M100 is a many-to-many model that translates between no less than 100 languages.

As shown above, a multilingual model only needs to know the target language, but it can infer the source language by itself. For example, we can create a translation from Swedish without explicitly telling the model that the input is Swedish:

model.translate("en", ["Hej Hanna, hur är läget?"])
# ['Hi Hanna, how is it?']

This is an interesting property, and a side effect is that the input language is allowed to be identical to the target language:

model.translate("en", ["Hi Hanna, how are you?"])
# ['Hi Hanna, how are you?']

The above is sometimes called zero-shot paraphrasing (Thompson and Post, 2020).

Different ways to compare two sentences

The basic principle behind NMTScore is that translation probabilities can be used to compare two sentences. For example, "Hello", "Good day", "Bonjour, and "Hej" all have similar meaning. On the other hand, "Hello", "Sleep well" and "Schadenfreude" are not similar with respect to their meaning.

In the past, researchers have already come up with creative ways of leveraging NMT to compare a sentence A to a sentence B:

For example, the translation probability of A given B can be used directly as a similarity measure (left-hand side; JunczysDowmunt (2018), Thompson and Post (2020)). Alternatively one can also estimate the probability that A is a translation of B via a pivot language, and use that as a similarity measure (right-hand side; Mallinson et al. (2017)).

Both approaches are useful if A and B are in two different languages (upper row), and also if A and B are in the same language (bottom row). The reason for that is that multilingual NMT models do not need to know the language of their input.

When creating the matrix above, we found that an interesting variant has not been tried before, and we added that column to the matrix:

The idea of translation cross-likelihood is that both A and B are translated into a target language (e.g. English). Specifically, we ask the model whether a translation of B could also be a good translation of A.

Again, this approach works both for sentences in the same language, and cross-lingually. The translation cross-likelihood measure has some other nice properties. For example, it is somewhat more symmetrical.

Advantages of translation-based text similarity

In our paper, we evaluate the different variants of NMTScore in two settings: multilingual paraphrase identification, and multilingual reference-based evaluation of generated text.

The goal of the former is to find out whether two sentences are paraphrases of each other. A similarity measure has high accuracy if it assigns higher similarity to the paraphrases in a dataset than to the non-paraphrases.

Overall, we found that NMTScore is competitive compared to common baselines such as embeddings derived from a pre-trained language model. We also propose a normalization scheme, called reconstruction normalization, and we show that it contributes to the high accuracy of NMTScore.

NMTScore is especially good with adversarial examples, where deceptively similar sentences pairs need to be distinguished (Zhang et al., 2019):

Here's the code to reproduce the figure:

from nmtscore import NMTScorer
scorer = NMTScorer("prism")

scorer.score(
    "Flights from New York to Florida.",
    "Flights from Florida to New York.",
)
# 0.67
scorer.score(
    "Flights from New York to Florida.",
    "Flights to Florida from New York.",
)
# 0.71

In the second evaluation setting – reference-based evaluation – we find a competitive performance to the baselines. This is especially relevant to NLP researchers who evaluate and compare text generation systems such as data-to-text systems. The researchers often use similarity measures (like BLEU) to compare the system output to a reference output, and NMTScore seems to be a relatively reliable choice for such a measure.

Outlook

Our library is available on GitHub. If you'd like to share a use case or a suggestion, please create an issue to let us know.

In summary, NMTScore and its variants are an attractive complement to other similarity measures. Keep in mind, however, that the open-source NMT models we use perform especially well with shorter text segments, and do not support all language pairs equally well.

The work presented in this post is joint work with my PhD supervisor Rico Sennrich. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).

References

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, and Armand Joulin. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021. URL: http://jmlr.org/papers/v22/20-1307.html. ↩

Marcin Junczys-Dowmunt. Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 888–895. Belgium, Brussels, October 2018. Association for Computational Linguistics. URL: https://aclanthology.org/W18-6478, doi:10.18653/v1/W18-6478. ↩

Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 881–893. Valencia, Spain, April 2017. Association for Computational Linguistics. URL: https://aclanthology.org/E17-1083. ↩

Yuan Zhang, Jason Baldridge, and Luheng He. PAWS: paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 1298–1308. Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL: https://aclanthology.org/N19-1131, doi:10.18653/v1/N19-1131. ↩

The Limits of Minimal Sentence Pairs

2021-10-12T00:00:00+02:00

In a paper presented at BlackboxNLP 2021, we highlight a limitation of minimal sentence pairs when it comes to predicting generative behavior, and propose a simple technique for improving their predictiveness. This blog post is a brief introduction.

Why minimal sentence pairs are useful

Minimal sentence pairs (Linzen et al., 2016) are frequently used for the contrastive evaluation of language generation models:

Sentences A and B differ in just one aspect. In Sentence A, there is agreement between the subject and the verb, whereas in Sentence B, the number of the verb disagrees with the subject. If a language model assigns a higher probability score to Sentence A and similar sentences than to Sentence B, this can be seen as a preference for subject-verb agreement.

A great advantage of minimal sentence pairs is that they allow to automate the evaluation of language models while isolating a specific linguistic phenomenon. But a question that has also been raised in previous work by Newman et al. (2021) is how much minimal pairs tell us about the generative behavior of a model.

Contrastive evaluation is based on a forced decision between two predefined sequences. However, at deployment time, end users are often exposed to the 1-best generated sequence, for example in machine translation or in dialogue. The sequence that end users are seeing might be completely different from the choices given to the model at evaluation time:

Contrastive evaluation in MT

In our paper, we set out to explore the limits of minimal pairs. In order to do that, we perform some experiments in the context of neural machine translation (NMT).

When evaluating NMT models, minimal pairs are used on the target side (Sennrich, 2017). Given a source sequence, two contrastive translation variants are presented to the model:

The example above shows two German translation variants for an English source sentence. Again, translation A preserves subject-verb agreement and translation B does not.

The probability score that is output by an NMT system is usually computed as the average log-likelihood of the target tokens, conditioned on the source sequence. It can be expected that a good NMT system assigns a higher score to the correct translation variant.

But do such minimal pairs always tell us how an NMT system will behave at deployment time?

Testing implausible hypotheses using minimal pairs

A straightforward way to explore the limits of such minimal pairs is to test an implausible hypothesis about the generative behavior of NMT systems.

In previous work, all the hypotheses that have been tested are more or less plausible, as for example the hypothesis that NMT systems observe subject-verb agreement. However, when we test some implausible hypotheses about generative behavior, we find that the results based on minimal pairs still produce some evidence for these implausible hypotheses.

Specifically, we look at two phenomena in the German language that are cognitive or social phenomena. The first implausible hypothesis is that NMT systems use vague language in the form of placeholder nouns very frequently. In spoken German, if someone doesn’t remember a noun, they might say ’Ding’ instead, which means ‘thing’ or ‘thingy’:

The second implausible hypothesis is that NMT systems are frequently producing hypercorrections. We look at hypercorrect genitives in German, where dative, rather than genitive, would be the more acceptable case for a preposition:

But why do we think that these phenomena are implausible? First of all, they are rarely found in the training data, so they would have to originate from somewhere else. In human speech, the phenomena are caused by cognitive and social factors, for example the tendency to forget a word when speaking, or the desire to attain social prestige. These factors do not apply to neural language models, making it implausible that they would produce the phenomena.

When creating test sets of minimal pairs for these phenomena, we find that our NMT systems do not reach 100 percent accuracy:

In other words, there are a certain number of instances where the NMT systems decide in favor of the translation containing vague language, or in favor of the translation containing a hypercorrection.

Taken at face value, these results would indicate that the NMT systems do generate the phenomena occasionally, and also that distilled NMT systems produce the phenomena more often than non-distilled models. But if you look at actual machine translations, you will almost never find the phenomena (since they are very implausible). So in this sense, minimal pairs would lead to false positives about the generative behavior of NMT systems, making comparisons between systems more difficult.

Some minimal pairs are more predictive than others

One reason why minimal pairs are not entirely predictive of generative behavior is that they are not among the translations that the NMT system would generate by itself, given the source sequence:

The 1-best translation has the highest probability score and is usually approximated using beam search in practice. In comparison, the contrastive variants of the minimal pair usually have a lower probability score.

One reason for that may be that they have been constructed by humans, and as such are sampled from a slightly different language distribution than what the system would generate by itself. This discrepancy might be especially large for distilled NMT systems, since they have never been exposed to human-written text during training.

It seems reasonable to assume that a high discrepancy of minimal pairs can hurt their predictiveness. We checked this by creating a second set of minimal pairs from machine-generated references.

We translated the same source sequences we used before with a variety of commercial MT systems, and constructed minimal pairs based on the machine translations:

In the above example, the commercial MT system has output a translation with correct subject-verb agreement (A-MT). Like before, the incorrect variant can be created by changing the number of the verb (B-MT). If you happen to speak German, you will also notice that the machine translation uses slightly simpler expressions than the human translation (A).

When testing our two implausible hypotheses again, we now find that the test sets derived from machine-generated references produce fewer false positives, showing that these test sets are now more predictive of generative behavior:

The DistilLingEval test suite

The success that we had with machine-generated references inspired us to also release similar contrastive test sets for other linguistic phenomena.

DistilLingEval is a test suite for 8 linguistic phenomena in English→German translation. While similar to LingEval97, our test suite has been created based on machine-generated references. As such, we expect it to be more predictive when it comes to generative behavior. Have a look at the GitHub repo to find out more about the test suite.

Our hope is that in future work, similar test sets will be created for other tasks, languages and linguistic phenomena.

The work presented in this post is joint work with my PhD supervisor Rico Sennrich. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).

References

Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535, 2016. URL: https://aclanthology.org/Q16-1037, doi:10.1162/tacl_a_00115. ↩

Benjamin Newman, Kai-Siang Ang, Julia Gong, and John Hewitt. Refining targeted syntactic evaluation of language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3710–3723. Online, June 2021. Association for Computational Linguistics. URL: https://aclanthology.org/2021.naacl-main.290, doi:10.18653/v1/2021.naacl-main.290. ↩

Rico Sennrich. How grammatical is character-level neural machine translation? Assessing MT quality with contrastive translation pairs. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 376–382. Valencia, Spain, April 2017. Association for Computational Linguistics. URL: https://aclanthology.org/E17-2060. ↩

When MT Distillation Leads to Bias

2021-08-29T00:00:00+02:00

In a paper presented at EMNLP 2021, we take a closer look at lexical overgeneralization in MT. The first part of the paper introduces a new technique for evaluating disambiguation, called contrastive conditioning (→ blog post). Here's an introduction to the second part of our paper: A case study showing that distilled MT models have a stronger overgeneralization bias.

The Impact of Disambiguation Errors

One of the great challenges of machine translation (MT) is inferring the correct word sense of ambiguous words. There are different ways in which words can be ambiguous – a well-known example are nouns that can mean multiple things.

For instance, the English noun starter can refer to an appetizer or to a motor part. Since German has different words for the two concepts, an MT system needs to decide between Vorspeise and Anlasser when translating starter into German:

Context usually helps with disambiguation. A starter that is made of avocados is probably not a motor part. But MT systems sometimes ignore this context and make disambiguation errors in spite of it.

Another interesting example are gendered occupation names. Stanovsky et al. (2019) have demonstrated how even the best MT systems tend to ignore pronouns when translating occupation names from English. This is a problem because morphologically rich languages such as German have different forms for female and male occupation holders:

What is especially unpleasant about such errors is that they are systematic: MT systems tend to ignore female pronouns more often than male pronouns. This is because the training data contain many more male occupation names, among other factors. Thus, disambiguation errors in MT are mostly caused by an overgeneralization of the training data.

Clearly, overgeneralization hurts the adequacy of machine translations. But when it comes to overgeneralization of gender, many people see an ethical problem as well. For example, if you are concerned about the dominance of male forms in human-written German texts (as many feminist linguists are), then you should also be concerned about an even greater dominance of male forms in the output of MT systems.

Background: Distillation for MT

In our case study we look at a technique called sequence-level knowledge distillation, since there are reasons to believe that it increases overgeneralization. This technique, originally proposed by Kim and Rush (2016), is often used to reduce the size of an MT model, for example to make it fit on your phone.

The idea is to use not one but three steps to train the model:

Train a normal MT model ("teacher").
Re-translate all the training data with the teacher.
Train a smaller MT model ("student") on those data.

A student model trained like this usually reaches a better BLEU score than if it had been trained on the original data. This means that there must be a difference between the original and the distilled data that makes training a model easier. While many researchers have looked for such a difference, this figure derived from a paper by Zhou et al. (2020) is a good visualization:

The distilled translations are more predictable given the source sentences. This means that they follow the structure of the source more closely and contain less human noise.

Our hypothesis was that the same phenomenon also leads student models to commit more disambiguation errors due to overgeneralization.

Distillation leads to Overgeneralization

A very simple test method is to count the words in the different strata of the training data during distillation. For example, to find out how distillation affects translations of doctor, we had a close look at versions of the WMT19 English–German training data:

In the original training data (created by human translators over decades), doctor is mostly translated into male forms such as Arzt and rarely into female forms such as Ärztin. (The center represents word forms that we could not automatically classify as male or female forms based on grammatical gender.) The distilled training data produced by the teacher have an even stronger imbalance, and the student, when we fed it the same sources, created translations that overgeneralize even more.

To see for yourself that this phenomenon is not specific to doctors, have a look at this Sankey diagram, which shows the same trend for 24 different occupations:

Results on Probing Tasks

A limitation of counting words is that there is no guarantee the context contains a clue about the correct translation. Many English source sentences might use the word doctor in an inherently ambiguous sense that would be impossible to disambiguate even for human translators. In that sense, we have so far only observed a weak form of overgeneralization.

To verify that distilled MT systems also have a strong overgeneralization bias (that they also tend to ignore the more salient contexts), we performed experiments using two probing datasets for word sense disambiguation: MuCoW (Raganato et al., 2019) and WinoMT (Stanovsky et al., 2019), and two language pairs (English–German and English–Russian). On top of the two datasets, we used a novel evaluation protocol called contrastive conditioning that allowed us to judge the 1-best translations of our models with a high recall.

Below I have included one of four graphs from the results (the others look similar and can be found in the paper).

Accuracies of various English–German models, distilled and non-distilled, on word sense disambiguation.

Our probing results confirm that distilled MT systems indeed have a higher overgeneralization bias than comparable non-distilled models, even if we control for BLEU. Another – expected – implication of the results is that BLEU scores do not capture overgeneralization bias reliably. It seems that to trace the effects of distillation in MT, targeted evaluation is needed as much as ever, and we hope that contrastive conditioning can contribute to it.

The work presented in this post is joint work with my PhD supervisor Rico Sennrich. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).

References

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1317–1327. Austin, Texas, November 2016. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/D16-1139, doi:10.18653/v1/D16-1139. ↩

Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. The MuCoW test suite at WMT 2019: automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), 470–480. Florence, Italy, August 2019. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/W19-5354, doi:10.18653/v1/W19-5354. ↩

Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1679–1684. Florence, Italy, July 2019. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/P19-1164, doi:10.18653/v1/P19-1164. ↩ ¹ ²

Chunting Zhou, Jiatao Gu, and Graham Neubig. Understanding knowledge distillation in non-autoregressive machine translation. In International Conference on Learning Representations. 2020. URL: https://openreview.net/forum?id=BygFVAEKDH. ↩

Evaluating Black-Box MT with Contrastive Conditioning

2021-08-28T00:00:00+02:00

In a paper presented at EMNLP 2021, we propose a new technique for evaluating disambiguation in machine translation called contrastive conditioning. This post is an introduction to the core ideas behind the technique.

White-Box vs. Black-Box MT

Machine translation (MT) comes with different user interfaces. Systems created in a research lab are white-box systems, which means that the code and the model weights are available to the researchers. On the other hand, commercial MT systems such as Google Translate are black boxes to most users. Independent researchers can only observe what goes in (the source sentence) and what comes out (the machine translation).

As MT systems are getting better and better, targeted evaluation has become more important. MT researchers analyze the performance of systems regarding specific linguistic phenomena, using carefully crafted test data. Scaling such a targeted evaluation is easier if you can peek into the model internals.

Contrastive Evaluation

A perfect example is the contrastive evaluation technique (Sennrich, 2017). There, the idea is to suggest two different translation variants to the model. The first variant is a correct translation and the second one contains the error type you're interested in:

Example for contrastive evaluation of an MT system (English–German). In this post, I am using the error type "wrong disambiguation of Turkey" as an example, since when translating from English into German, you need to choose between Türkei (the country) and Truthahn (the bird). This is just an illustrating example, not an error that a modern MT system is likely to make.

One can expect that a good model assigns a higher score to the correct translation, like in the example above. However, such scores (probability estimates of a translation given a source sentence) can only be computed with white-box systems. You cannot apply contrastive evaluation to Google Translate and similar services.

Pattern-Matching Evaluation

To also subject black-box MT to targeted evaluation, researchers have come up with pattern-matching approaches that automatically analyze the translation output. For example, Raganato et al. (2020) search the translation for different BabelNet senses, which in our example allows them to check whether Turkey is translated correctly into German:

Example for a pattern-matching evaluation (English–German)

However, there are always going to be a few translations (let's say 20%) that do not match any of the patterns. Natural language has many ways of expressing a concept, and it is difficult to enumerate them all. Another drawback, which also applies to contrastive evaluation, is that you need to prepare data for every target language of interest.

What we find exciting about the new method is that it promises, at least for phenomena such as lexical disambiguation, to combine the best of both worlds: The recall of contrastive evaluation and the black-box accessibility of pattern-matching.

Inspiration for Contrastive Conditioning

The basic idea is that by varying the source sequence to provide more disambiguation cues, you can learn about the ground truth without speaking the target language. For example, let's begin with our original, slightly ambiguous Turkey sentence:

If you replace Turkey with modern Turkey, you give Google Translate an additional cue that Turkey is supposed to be a country, not a bird. Given such a cue, Google Translate is less likely to make a disambiguation error:

Since Google Translate has not stopped using the German word Türkei, the original sentence seems to have been translated correctly. You can verify this by replacing modern Turkey with frozen Turkey:

The word Türkei has now disappeared from the translation, which is further proof that Türkei refers to a country, and not to a bird.

Formalization

Countless power users of MT have probably already come up with this algorithm. The challenge is to apply it to targeted evaluation in a systematic and formalized way.

Our approach is based on scoring, like in contrastive evaluation. However, what we score is not a human-crafted translation, but the translation by the MT system. We compute two scores for the translation: one score based on a correct disambiguation cue, and another score based on an incorrect disambiguation cue. If the first score is higher than the second score, the translation is probably correct:

To make things more clear, let's do a comparison to the standard approach of contrastive evaluation: In the standard approach, two contrastive variants of the translation are scored, conditioned on the same source sequence. In our approach (contrastive conditioning), the same translation is scored twice, conditioned on contrastive variants on the source.

That may not seem like a big difference. But crucially, our method does not depend on the evaluated model to produce the scores. It only requires a translation output, and then any MT system can play the role of an evaluator model. If you are evaluating a black box you need an additional white-box MT system for that (for example from the OPUS-MT project).

Other Benefits

In addition to its black-box nature, contrastive conditioning has a few other interesting properties. For example, since the ratio between the two scores can be seen as the confidence of the evaluator, it is possible to weigh the test samples by this confidence. This is helpful if some test samples are difficult to judge.

Another advantage is that the test data can be reference-free. All the contrastive modifications happen in the source language. In that sense, contrastive conditioning is easy to scale across many target languages.

Outlook

It is an open question how many linguistic phenomena are amenable to contrastive conditioning. In our paper, we perform a successful case study on two well-known challenges for MT: Word sense disambiguation and the translation of gendered occupation names into morphologically rich languages. I am going to discuss our findings in a follow-up post.

It will be interesting to see whether phenomena beyond disambiguation can also be evaluated using contrastive conditioning. And in my view, disambiguation remains a great challenge for MT, even though few modern systems are as bad as the one that produced that infamous clothing label:

🤷‍♂️ German is hard... pic.twitter.com/Qe81wLhAnQ
— Benedikt Meurer (@bmeurer) March 19, 2019

The work presented in this post is joint work with my PhD supervisor Rico Sennrich. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).

References

Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. An evaluation benchmark for testing the word sense disambiguation capabilities of machine translation systems. In Proceedings of the 12th Language Resources and Evaluation Conference, 3668–3675. Marseille, France, May 2020. European Language Resources Association. URL: https://www.aclweb.org/anthology/2020.lrec-1.452. ↩

Rico Sennrich. How grammatical is character-level neural machine translation? assessing MT quality with contrastive translation pairs. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 376–382. Valencia, Spain, April 2017. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/E17-2060. ↩

More General Stance Detection with x-Stance

2020-06-11T00:00:00+02:00

Update (2020-08-01): A video of the conference presentation is now available.

Automated stance detection systems try to detect broad opinions in natural language expressions. This post is an introduction to a new resource for stance detection called x-Stance.

Background: Stance Detection

The idea behind stance detection is best explained in an example. This tweet has recently earned more than 2 million likes:

After stoking the fires of white supremacy and racism your entire presidency, you have the nerve to feign moral superiority before threatening violence? ‘When the looting starts the shooting starts’??? We will vote you out in November. @realdonaldtrump
— Taylor Swift (@taylorswift13) May 29, 2020

Of course, the tweet does not just constitute a neutral election forecast. It also expresses a negative stance towards Donald Trump.

The type of stance detection system we are interested in is a system that takes an input text and a target and outputs either favor, against or neutral:

Stance detection systems have been made possible by a series of annotated datasets – from a collection of English tweets on Donald Trump and other topics all the way to a collection of Italian tweets on the Italian constitution.

However, it has been unclear if the systems can generalize well beyond those specific settings. For example, an English system is not necessarily applicable to Italian or French tweets. And if a system trained on the target of Donald Trump does not generalize to future presidents, then the data collection effort might not be very sustainable.

The generalization problem in stance detection. The two dimensions of transfer have been studied individually (Mohammad et al. (2016), Taule et al. (2017), Taule et al. (2018)), but not jointly.

In x-Stance: A Multilingual Multi-Target Dataset for Stance Detection, we present a large-scale dataset in 3 languages and on more than 150 political issues. We show that x-Stance can be used to train a single model on all of those issues.

We also look at generalization performance: Our models are evaluated both in a held-out language and on held-out targets. We find that if a standard text classification model is used, zero-shot cross-lingual and cross-target transfer is moderately successful.

How We Created x-Stance

x-Stance contains 67,000 samples, which is an order of magnitude more than what has been common in stance detection. The dataset is relatively large because instead of doing manual annotation, we have extracted the data directly from a political website. The website – smartvote.ch – is a voting advice application that is highly popular in Switzerland.

Electoral candidates who participate in such a voting advice application are asked a range of questions on controversial topics. Should cannabis use be legalized? Should Switzerland strive for a free trade agreement with the United States?

What makes this website so interesting for stance detection is that the candidates can respond in two ways: On a yes/no scale and in a free-text comment:

A part of a candidate's response on smartvote.ch.

While it is not mandatory, candidates often like to write a few sentences in order to justify, explain or differentiate the yes/no answer in their own language. If we now reverse this relation and interpret the yes/no answer as an annotation of the comments, we receive a supervised data set for stance detection:

Such an automatically extracted dataset is likely noisier than manually curated datasets, but we still expect it to be useful for machine learning research. And thanks to the Smartvote team, we can make it available to fellow researchers under the CC BY-NC 4.0 license.

Apart from stance detection as a supervised task, we believe that the x-Stance is also a valuable resource for the study of transfer learning.

Putting the «x» in x-Stance: Transfer Learning

First of all, x-Stance is a multilingual dataset: Since different languages are spoken in different parts of Switzerland, candidates have been free to answer in any language, be it German, French or Italian. (To our regret we did not encounter any Romansh comments).

Secondly, the x-Stance dataset contains questions on diverse policy issues. We have clustered the questions into 12 broad topics, including 2 held-out topics:

We can now test both for cross-lingual transfer from German and French to Italian, and for cross-target transfer from known topics to previously unseen topics such as healthcare:

In our paper we use a standard architecture – BERT – to demonstrate how x-Stance can serve as a benchmark for transfer learning.

Adapting BERT to x-Stance

We download the multilingual BERT model, which has been pre-trained by Google Research in 104 languages. We then fine-tune the model on our German and French training data.

As we want to train the model jointly on multiple targets (Free Trade, Legality of Cannabis, …), we need to inform the model about the specific target of every instance. For this we concatenate to each comment the corresponding natural-language question from Smartvote.

As BERT allows for two input segments, we designate the question as segment A and the comment as segment B. We put a linear classifier on top of BERT and train the model to predict either FAVOR or AGAINST given a question–comment pair:

Input and output of a BERT sequence pair classifier. Original image by Artetxe et al. (2019).

In a supervised setting, which involves previously seen targets and languages, we find that BERT can clearly surpass a simple majority-class baseline:

In the cross-lingual setting, the zero-shot performance in Italian is much better than the baseline. But the performance is higher in German and French, because the model has been trained on samples in those languages.

Finally, the model can also generalize to held-out topics. If the model is asked to detect the stance of a text towards a previously unseen target, it performs better than a global majority-class baseline:

Bringing it All Together

Given the x-Stance dataset for stance detection, a multilingual BERT model has some capability to perform zero-shot transfer to unseen languages and to unseen targets. However, there is a gap in performance between the supervised settings and the zero-shot settings that future work could address. For example, even better representations or a more sophisticated classification architecture could be used.

Learn More about x-Stance

Our GitHub repository contains the full dataset as well as the evaluation script and the code for our baselines.
The x-Stance paper has been presented at the 5th SwissText & 16th KONVENS Conference 2020. In the paper you will find more details, and also references to related work, which have mostly been omitted in this blog post.
x-Stance is also part of the datasets library from Huggingface. Use the live viewer to have an interactive look at the dataset.

The work presented in this post is joint work with my PhD supervisor Rico Sennrich. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).

References

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856, 2019. ↩

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. SemEval-2016 task 6: detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 31–41. San Diego, California, June 2016. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/S16-1003, doi:10.18653/v1/S16-1003. ↩

Mariona Taulé, M Antònia Martí, Francisco Rangel, Paolo Rosso, Cristina Bosco, and Viviana Patti. Overview of the task on stance and gender detection in tweets on catalan independence at ibereval 2017. In 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval 2017, volume 1881, 157–177. 2017. URL: http://ceur-ws.org/Vol-1881/Overview5.pdf. ↩

Mariona Taulé, Francisco Rangel, M Antònia Martí, and Paolo Rosso. Overview of the task on multimodal stance detection in tweets on catalan #1oct referendum. In 3rd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval 2018, volume 2150, 149–166. 2018. URL: http://ceur-ws.org/Vol-2150/overview-Multistance18.pdf. ↩

BERT for NER

2019-06-19T00:00:00+02:00

BERT models, when fine-tuned on Named Entity Recognition (NER), can have a very competitive performance for the English language. This is an overview of how BERT is designed and how it can be applied to the task of NER. In the last section, I will discuss a cross-lingual scenario.

In this post, I will assume a basic familiarity with the NER task. When I talk about implementation details of BERT (Devlin et al., 2019), I am referring to the PyTorch version that was open-sourced by Hugging Face. I have not checked if it completely matches the original implementation with respect to those details.

First let us look at what goes into the BERT model, because it is rather important for getting NER right.

Preprocessing

The input to BERT is preprocessed using WordPiece tokenization (Johnson et al., 2017), which is a technique comparable to Byte Pair Encoding (Sennrich et al., 2016). The vocabulary is trained on the pre-training data, then re-used for the fine-tuning without any modifications. For the English language model, a vocabulary of 30k tokens is used, and for the multilingual model, 110k tokens.

It is important to note that NER datasets like CoNLL-2003 are already tokenized, as the gold annotation is provided token by token. For training and evaluating BERT, the WordPiece tokenizer strictly deepens the preexisting tokenization. For every original token, the WordPiece tokenization can take one of two forms:

One-to-one tokenization. The token is in the vocabulary. In this case, it is represented by a single WordPiece.

One-to-many tokenization. The token is not in the vocabulary; in this case, the WordPiece tokenizer will split the token into a sequence of vocabulary items using a greedy longest-match approach. I call the first WordPiece of such a split the head and the other WordPieces the tails.

Tails are marked with a leading “##” by the WordPiece tokenizer. It follows that the vocabulary is divided into two distinct sets: WordPieces that can occur both in a single or in a head role on the one hand (no leading “##”), and tail WordPieces on the other hand (with a leading “##”).

Example sentence illustrating one-to-one tokenization and one-to-many tokenization. In this example, “Leicester” is what I call a "head" WordPiece, and “##shire” is a "tail" WordPiece..

Examples for vocabulary entries of bert-large-cased that can occur as heads or single token (left) and entries that can only occur as tails (right).

WordPiece embeddings are only one part of the input to BERT. The full input is a sum of three kinds of embeddings, each with a size of 768 for BERT-Base (or 1024 for BERT-Large):

WordPiece embeddings, which like the other embeddings are trained from scratch and stay trainable during the fine-tuning step.

Segment embeddings. These kinds of segments are only relevant for tasks where a pair of sentences is classified. For NER, the embedding is insignificant, and the same segment (“A”) can be used for all tokens.

Position embeddings, which compensate for the non-recurrence of the transformer layers.

During pre-training, the input has a maximum length of 512 WordPieces. When BERT is finetuned on NER, using such long sequences is unnecessary, given that NER is usually done sentence by sentence. A sequence length of 150 should be enough, as the sentences in the English CoNLL-2003 validation set have 14.5 (original) tokens on average, and the longest sentence has 109 tokens. An alternative to having long enough sequences is a sliding window approach as described by Wu and Dredze (2019).

BERT also expects that each sentence starts with a [CLS] token and ends with a [SEP] token. These special tokens are not particularly relevant for the NER task, considering that classification is done token-wise and the special tokens have no associated tag. Nevertheless, they should be included so that the fine-tuning input is not too different from the pre-training input.

For most of the tasks BERT was evaluated on, a model with a lowercase vocabulary was used. For NER, however, the cased variant should be used and no lowercasing should be performed during preprocessing.

Architecture

BERT is a stack of Transformer layers (Vaswani et al., 2017). The variant of size “Base” has 12 layers, 12 self-attention heads, and a hidden state size of 768 per token. In total, BERT-Base has 110 million trainable parameters. The BERT-Large variant has 24 layers, 16 self-attention heads and a hidden size of 1024, which amounts to 340 million parameters.

The Transformer implements some innovative ideas which are highly relevant for the NER task:

Self-attention. Self-attention here is the idea to encode a token as the weighted sum of its context. The weights are computed as a function of the context token (key) and the token to be encoded (query), and this function has trainable weights. Other than RNN layers, which are also designed to represent tokens in context, the self-attention layer handles the context as a bag of words. This design decision is crucial:On the one hand, the parallelizability of the bag-of-words pattern makes it possible to efficiently pre-train BERT on very large corpora. On the other hand, steps need to be taken to incorporate word order in another way (see Soft Sequentiality below).

Multi-head attention. When a token is encoded as a weighted sum of its context, this is a rather coarse representation. For this reason, the designers of the Transformer introduced Multi-head attention to increase the representational capacity of the model. Multi-head attention means that the self-attention step is performed multiple times in parallel, with different weight matrices. The output of the attention heads is then concatenated and projected to the size of a single attention head.In theory, this would allow a NER model to learn different attention heads for different classes. In reality, however, the classes are not separated that way, as preliminary investigations have shown.

Stacking. As is often done with RNNs, too, multiple self-attention layers are stacked. Interspersed with the self-attention layers are token-wise feed-forward layers, and all layers are connected via skip-connections, too. But interpreting differences in the various trained layers has turned out to be difficult. With respect to NER, Wu and Dredze (2019) have shown that BERT layers to not consistently become more language-independent towards the final layer. Still, the layeredness is hoped to improve the abstractive capacity of the model, which would help to solve a task such as NER requiring complex semantics.

Soft sequentiality. This is a concept I like to use for how a transformer incorporates the sequentiality of the tokens. In an RNN, the sequence of tokens is hardwired through recurrence. On the other hand, in a transformer, the sequence is a feature of the tokens, which “softens” the sequentiality. The feature is chosen such that the model can generalize well to sequences of arbitrary length (a function of the sine wave).

It goes without saying that word order is crucial for NER, as these made-up headlines illustrate

Pre-Training

The idea behind pre-training is to initialize the model with a general language-modelling capacity that can later be transferred to a multitude of NLP tasks. While for most researchers and practitioners, pre-training BERT is rather expensive, domain-specific pre-training is known to further boost performance (Xie et al., 2019). For completeness, I will summarize the pre-training step:

Two prediction tasks are used which are entirely self-supervised.

Masked Language Modelling (MLM). 15 % of the tokens in a sentence are randomly selected. The selected tokens are “masked” in a randomized procedure; the other tokens are unchanged. BERT learns to predict the original word from the output hidden states corresponding to the masked words, using a softmax layer.The masking procedure is defined as follows: In 80% of the cases, the token is replaced with a special [MASK] token. In 10% of the cases, the token is replaced with a word chosen randomly from the vocabulary. In the remaining 10%, the token is left unchanged but still included in the loss.

Next Sentence Prediction. The pre-training input consists of two segments, A and B. In 50% of the cases, the segments form a sequence in the original text. In the other cases, they have been randomly paired. From the last hidden state corresponding to the [CLS] token (sometimes called “pooled model output”), BERT learns to predict whether the B is a “next sentence” to A or not.

The originally published English models have been pre-trained on English Wikipedia, and books. From the two tasks, BERT has learnt representations of both words and sentences, which is a good starting point for fine-tuning.

Fine-Tuning for NER

Illustration of BERT for NER (Devlin et al. 2018)

When BERT is fine-tuned on a task, the pre-trained Transformer functions as an encoder, and a randomly initialized classifier is now added on top. In the case of NER, the classifier is simply a projection from the token hidden state size to the size of the tag set, with a subsequent softmax operation to turn the scores into likelihoods. The token classifier is shared across all positions. Considering that no RNN and no CRF layer is used, the NER classifier is simpler than the classifier proposed by Lample et al. (2016), which was used to achieve 93.5 F1 with the BERT “competitor” from Facebook AI Research (Baevski et al., 2019).

To understand the classification step, it is useful to remember how the input has been preprocessed a few sections above. The one-to-many property of the WordPiece tokenizer ensures that for every token in the original dataset, there is at least one individual WordPiece that can be tagged. If an original token is split into multiple WordPieces (one-to-many case), then all WordPieces are tagged by the classifier, but only the head predictions are included in the loss and in the output at runtime. In other words, the head WordPiece serves as a proxy for the full original token. Thus, only the hidden states corresponding to single or head WordPieces are a relevant part of BERT’s last layer, while for the lower levels, all hidden states remain relevant.

During fine-tuning, both the encoder and the classifier are trained with a small learning rate. Fine-tuning is usually done for a small number of epochs, e.g. 4, and with a two-phased optimization procedure (Vaswani et al., 2017):

Warmup. For a percentage of the fine-tuning steps (default: 0.1), the learning rate is increased from 0 (default: linearly).

Decay. For remaining steps, the learning rate is decreased such that it is zero in the end (e.g. linearly).

In order to achieve the results they have published, the BERT authors have selected the best learning rate out of {5e-5, 3e-5, 2e-5} based on the validation performance.

They have reported the following results for the English CoNLL-2003 test set:

Model size	English CoNLL-2003 Test F1
BERT-Base	92.4
BERT-Large	92.8

As an alternative to fine-tuning, Devlin et al. (2019) report results on a feature-based approach, too. In this case, a two-layer BiLSTM is put on top of the pre-trained BERT, and during training, the BERT parameters remain frozen. Comparing the validation set results of this approach to the validation set results of the fine-tuning approach, we can see that the more expensive fine-tuning brings an improvement for NER, however a small one:

	English CoNLL-2003 Dev F1
Fine-tuning approach
BERT-Base	96.4
BERT-Large	96.6
Feature-based approach
WordPiece Embeddings (first layer)	91.0
Second-to-Last Hidden	95.6
Last Hidden	94.9
Weighted Sum Last Four Hidden	95.9
Concat Last Four Hidden	96.1
Weighted Sum All 12 Layers	95.5

Cross-Lingual Transfer

Cross-lingual NER is a scenario where there are enough data for a source language (usually English), and only little data for a target language. For this post I will look at the most extreme case, where there are only evaluation data, and no training data at all, for the target language. This case is often called zero-shot transfer.

The team behind BERT has open-sourced a multilingual BERT model (sometimes called mBERT) that allows for experiments in this direction. Right now, the only documentation available is in a README on GitHub.

mBERT has been trained on 104 Wikipedias in different languages. Generally, the corpora have been treated as if they were from a single language: There is no explicit denotation of the input language. This way, the BERT architecture di not have to be changed and mBERT can now be used with any previously unseen language (e.g. Alemannic, which was not included as it has the 108th largest Wikipedia).

For the vocabulary creation and the pre-training, relatively small languages were upsampled to a degree. I do not know whether the random replacing of words and the random pairing of sentences was done across languages or not – there would be arguments for and against doing so.

A systematic study on the cross-lingual effectiveness of mBERT has been conducted by Wu and Dredze (2019). Their experiment was set up as follows:

Fine-tuning. For NER, they first fine-tune bert-base-multilingual-cased on English CoNLL-2003, choosing a combination of learning rate, batch size and number of epochs that has the best performance on the English validation set

Zero-shot evaluation. Then they evaluate the model on the test set of another language (German, Spanish and Dutch). They want to find out whether mBERT can generalize to other languages without having seen target-language examples for the fine-tuning task.

They report the following results for the CoNLL-2002/2003 languages:

	EN	DE	NL	ES	ZH
Supervised	91.97	82.82	90.94	87.38	93.17
Zero-shot SOTA (Xie et al. 2019)		57.76	71.25	72.37	–
Zero-Shot	–	69.56	77.57	74.96	51.90

Zero-shot transfer is of course not better than supervised learning, but mBERT clearly outperforms the previous zero-shot state of the art by Xie et al. (2018).

Pires et al. (2019) perfomed the same experiment and even provide numbers for all source–target combinations of the CoNLL data:

↓ Fine-tuning \ Eval →	EN	DE	NL	ES
EN	90.70	69.74	77.36	73.59
DE	73.83	82.00	76.25	70.03
NL	65.46	65.68	89.86	72.10
ES	65.38	59.40	64.39	87.18

It is especially interesting that zero-short transfer to Dutch seems to benefit more from a Spanish source than from English or German.

The slight differences between the two tables can be explained by implementation details and different hyperparameters, considering that Pires et al. (2019) did not do any hyperparameter tuning.

To conclude, BERT is a good basis to achieve state-of-the-art results on Named Entity Recognition, and the fine-tuning for this task is relatively simple to implement. The multilingual BERT model even allows for an easy cross-lingual transfer, but the results here still have room for improvement.

References

Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze-driven pretraining of self-attention networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5363–5372. 2019. ↩

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. 2019. ↩ ¹ ²

Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, and others. Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017. ↩

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 260–270. 2016. ↩

Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual bert? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4996–5001. 2019. ↩ ¹ ²

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725. 2016. ↩

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 5998–6008. 2017. ↩ ¹ ²

Shijie Wu and Mark Dredze. Beto, bentz, becas: the surprising cross-lingual effectiveness of bert. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 833–844. 2019. ↩ ¹ ² ³

Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A Smith, and Jaime G Carbonell. Neural cross-lingual named entity recognition with minimal resources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 369–379. 2018. ↩

Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019. ↩

How NLP affects Gender Equality

2019-04-16T00:00:00+02:00

More and more applications of Natural Language Processing (NLP) are used in everyday life: When you translate a paragraph online, you are using an application of NLP, and also when you dictate a letter to a speech recognition system or when you ask questions to a voice assistant. The business world, too, is full of hidden but powerful applications of NLP.

Given this increasing applicability, researchers need to be aware of ethical concepts such as gender equality. First of all, NLP makes use of gender as an explicit variable in many places. For example, state-of-the-art systems can infer the gender of a writer with a certain degree of accuracy, based on stylistic features of the text (see Nguyen et al. (2016) for a critical survey). In an extreme case, this technology could allow employers to circumvent anonymization of job applications.

Secondly, gender is ubiquitous in NLP systems as an latent variable. To give an extreme example, think of a system that evaluates the quality of application letters. Even if the system seems harmless from a user perspective, it still might implicitly consider the applicants’ gender based on stylometric hints. If the company’s employment history has been biased towards men in the past, the system might wrongly infer that female applicants are less qualified in general. Another word for this phenomenon is proxy discrimination, as recently discussed in a NYT op-ed article.

Learning to discriminate

Let us have a closer look on gender as a latent variable. How can such a variable come into play? In NLP, most training data are derived from speech that humans have uttered somewhere in the past. Traditionally, the data are curated and annotated by experts.

But researchers like to experiment with unsupervised approaches, feeding to NLP systems gigantic amounts of raw textual data: books, social media posts, newspapers, or Wikipedia articles. Digesting those billions of utterances, systems pick up information about the morphology, syntax and semantics of a language.

It has been shown that this way, the systems also acquire a concept of gender – and one that is fraught with stereotypes. For example, many systems display the same gender biases that have been measured in humans: They associate words denoting women (“woman”, “girl”) less with math or science, and more with the arts, than male words (Caliskan et al., 2017).

Correlation between association of word embeddings, and actual employment numbers as shown by Caliskan et al. (2017).

One could argue that this kind of latent gender bias is not an issue of NLP itself. On this account, the systems reflect society as it is, with all its faults. For example, the association of certain occupations with the feminine gender could be seen as an inherent property of natural language.

In the same way of thinking, a latent gender bias could be attributed to the annotators of the data, who are known to impress their own biases on the training data. To give an example, annotators asked to caption images of people tend to overspecify the gender: They would label a snowboarder as a man even if the face is not visible, and in consequence, image captioning systems learn to associate snowboards with men (Hendricks et al., 2018).

Finally, the bias could be attributed to the end-users and their preferences. For example, the fact that all major developers of voice assistants have chosen a female voice for their product is usually justified with customer expectations.

Comparison of how different versions of a system caption the image of a snowboarder seen from behind (Hendricks et al., 2018)

But this argument overlooks that the elements of an NLP system have been intentionally and actively composed by people. In addition, bias does not always stem from outside, but can also emerge from the system itself (Friedman and Nissenbaum, 1996).

A military speech recognition system may be developed with mostly male soldiers in mind, which in itself many people would find ethically acceptable. But if the system would later be marketed to a general public without modifications, and would have a worse performance for female voices, a new kind of gender bias had emerged.

This scenario is not far-fetched, as one of the most popular English speech corpora has been commissioned by the U.S. military 30 years ago. In fact, the imbalance of this dataset may partly explain why the YouTube captioning system works better for male voices, or a least did so two years ago (Tatman, 2017).

An approach to avoid accidental bias in NLP systems has been proposed by Bender and Friedman (2018). In their view, a standard called data statements would require the creators of a new dataset to describe the circumstances of creation, and from the users of the dataset to reiterate this statement in a brief paragraph everytime they use it. While Bender and Friedman do not assume that all bias can be removed from systems through a proper declaration of data, they hope that the research could be better contextualized.

Apart from preventive approaches, a lot of technical solutions for de-biasing a system post hoc have been proposed (Bolukbasi et al., 2016). Furthermore, new forms of testing have been proposed to measure the gender-biasedness of NLP systems (Zhao et al., 2017). Those solutions are often tailored to a specific scenario and do not offer a systemic solution (Gonen and Goldberg, 2019). But they show that researchers have become aware of latent bias and that there is a discussion on how to avoid it.

The three roles of gender in NLP

On the other hand, systems that make explicit use of gender are problematized less frequently, even though the concept of gender has been challenged by Gender Studies and related fields for decades. For example, a recent interdisciplinary review of gender classification fails to mention the existence of constructivist or non-binary approaches to gender entirely (Lin et al., 2016).

In a first set of guidelines on the explicit use of gender in NLP, Larson (2017) proposes the following: Researchers should always specify what concept of gender they employ, even if just means quoting a classic definition. In addition, researchers should only utilize information on gender if it is necessary to achieve the research objective, and not just because the data are already available or because it is easier to ask the question “What is your gender?” than it is to ask other questions. Researchers should also specify what method they use to distinguish between genders in the annotation process.

Larson’s guidelines are a balanced ethical foundation for future NLP research into gender. However, there is one aspect that goes unmentioned by Larson: How NLP technology could me misused by malicious people to discriminate on the basis of gender. Would it be prudent to stop NLP research altogether in order to prevent abuse?

When discussing the consequences of technology, ethicists often employ the concept of dual use (Hovy and Spruit, 2016). To give an example, every system that can be used for the inference of gender from language (gender profiling) can also be used to rewrite the text such as to prevent this inference (gender obfuscation; Reddy and Knight (2016)). Another – reverse – example are systems that can detect misogynist tweets but that could also be misused to automatically generate misogynist speech).

Style transfer experiment that could also be used for obfuscation of gender (Lample et al., 2019)

I believe that due its specific nature, NLP has a third use in addition to this dual use: Because they can analyze data on a large scale, NLP systems can inform a critical public of preexistent bias that manifests in natural-language texts. There are many studies where this third, sociolinguistic-diagnostic use is applied, from the analysis of letters of recommendation for male/female job applicants (Schmader et al., 2007) to the analysis of questions that sports journalists pose to male/female tennis players (Fu et al., 2016). In another example, Fast et al. (2016) analyze amateur fiction using NLP and find an abundance of gender stereotypes in every genre, irrespective of the author’s gender. As a final example, Garg et al. (2018) quantify the historical development of stereotypes based on Google Books, and show a correlation with U.S. census data.

This third use of NLP is clearly a chance to promote gender equality, even though some may criticise that all those studies assume a binary view of gender. I think that according to the guidelines by Larson (2017), this simplification can be justified as there is a clear research objective: making discrimination visible.

In my view, NLP offers chances as well a threats to gender equality, and the threats can have various sources: Preexistent societal bias, emergent bias, or the unreflected use of gender as an explicit variable. Given the promises that NLP holds for diagnosing a discriminatory use of language, there is hope that the opportunities will eventually outweigh the threats.

This is an abridged version of an essay written as part of the certificate program “Gender and Diversity Competence” at LMU Munich.

References

Emily M Bender and Batya Friedman. Data statements for natural language processing: toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604, 2018. ↩

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam Tauman Kalai. Quantifying and Reducing Stereotypes in Word Embeddings. CoRR, 2016. URL: http://arxiv.org/abs/1606.06121, arXiv:1606.06121. ↩

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017. ↩ ¹ ²

Ethan Fast, Tina Vachovsky, and Michael S Bernstein. Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community. In Tenth International AAAI Conference on Web and Social Media. 2016. ↩

Batya Friedman and Helen Nissenbaum. Bias in computer systems. ACM Transactions on Information Systems (TOIS), 14(3):330–347, 1996. ↩

Liye Fu, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. Tie-breaker: Using language models to quantify gender bias in sports journalism. In Proceedings of the IJCAI workshop on NLP meets Journalism. 2016. ↩

Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644, 2018. ↩

Hila Gonen and Yoav Goldberg. Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 609–614. 2019. ↩

Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision, 793–811. 2018. ↩ ¹ ²

Dirk Hovy and Shannon L Spruit. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, 591–598. 2016. ↩

Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc'Aurelio Ranzato, and Y-Lan Boureau. Multiple-attribute text rewriting. In International Conference on Learning Representations. 2019. URL: https://openreview.net/forum?id=H1g2NhC5KQ. ↩

Brian Larson. Gender as a Variable in Natural-Language Processing: Ethical Considerations. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, 1–11. 2017. ↩ ¹ ²

Feng Lin, Yingxiao Wu, Yan Zhuang, Xi Long, and Wenyao Xu. Human gender classification: a review. International Journal of Biometrics, 8(3-4):275–300, 2016. ↩

Dong Nguyen, A Seza Doğruöz, Carolyn P Rosé, and Franciska de Jong. Computational sociolinguistics: A survey. Computational linguistics, 42(3):537–593, 2016. ↩

Sravana Reddy and Kevin Knight. Obfuscating gender in social media writing. In Proceedings of the First Workshop on NLP and Computational Social Science, 17–26. 2016. ↩

Toni Schmader, Jessica Whitehead, and Vicki H Wysocki. A Linguistic Comparison of Letters of Recommendation for Male and Female Chemistry and Biochemistry Job Applicants. Sex Roles, 57:509–514, 2007. ↩

Rachael Tatman. Gender and Dialect Bias in YouTube's Automatic Captions. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, 53–59. 2017. ↩

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2979–2989. 2017. ↩