<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Jannis Vamvas</title><link href="https://vamvas.ch/" rel="alternate"></link><link href="https://vamvas.ch/feeds/all.atom.xml" rel="self"></link><id>https://vamvas.ch/</id><updated>2025-06-23T00:00:00+02:00</updated><entry><title>The Joy of Multiple-Choice</title><link href="https://vamvas.ch/the-joy-of-multiple-choice" rel="alternate"></link><published>2025-06-23T00:00:00+02:00</published><updated>2025-06-23T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2025-06-23:/the-joy-of-multiple-choice</id><summary type="html">&lt;p&gt;Box-ticking exams save ink and time.&lt;/p&gt;</summary><content type="html">&lt;p&gt;After attending the course &lt;em&gt;"Examinations with Multiple Choice Items"&lt;/em&gt;&lt;sup id="sf-the-joy-of-multiple-choice-1-back"&gt;&lt;a href="#sf-the-joy-of-multiple-choice-1" class="simple-footnote" title="Organized by Antonia Bonaccorso and Tobias Halbherr of the ETH/UZH Didactica Program."&gt;1&lt;/a&gt;&lt;/sup&gt;, I gained valuable insights into writing effective multiple-choice questions for university teaching.
Having just completed another round of exams, I'd now like to share what I've learned.&lt;/p&gt;
&lt;h2&gt;Pros and Cons of Multiple-Choice Exams&lt;/h2&gt;
&lt;p&gt;The advantages of multiple-choice questions are well known. Most importantly, they are a quite efficient form of examination because they are so &lt;strong&gt;structured&lt;/strong&gt; and &lt;strong&gt;unambiguous&lt;/strong&gt;:&lt;/p&gt;
&lt;ul style="list-style-type: '🟩   '; margin-left: 1em; gap: 0.7em; display: flex; flex-direction: column;"&gt;
  &lt;li&gt;
    Students can demonstrate their knowledge without writing lengthy essays. In terms of ink-to-knowledge ratio, multiple-choice questions are &lt;strong&gt;more efficient&lt;/strong&gt; than open-ended questions.
  &lt;/li&gt;
  &lt;li&gt;
    For instructors, grading of multiple-choice items can be &lt;strong&gt;delegated to teaching assistants&lt;/strong&gt; (or even automated if the exam is digital).
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beyond time efficiency, I've come to value several other benefits of multiple-choice questions:&lt;/p&gt;
&lt;ul style="list-style-type: '🟩   '; margin-left: 1em; gap: 0.7em; display: flex; flex-direction: column;"&gt;
  &lt;li&gt;
    They work in a variety of settings (pen-and-paper exams, live Kahoot quizzes during lectures, or self-paced assessments). This versatility makes it &lt;strong&gt;easier to reuse items&lt;/strong&gt; across different formats.
  &lt;/li&gt;
  &lt;li&gt;
    Multiple-Choice questions and answers can be automatically &lt;strong&gt;shuffled&lt;/strong&gt;, making it harder for students to cheat by copying from each other.
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finally, I believe that &lt;strong&gt;interoperability with AI tools&lt;/strong&gt; is an important consideration:&lt;/p&gt;
&lt;ul style="list-style-type: '🟩   '; margin-left: 1em; gap: 0.7em; display: flex; flex-direction: column;"&gt;
  &lt;li&gt;
    AI tools like ChatGPT or NotebookLM can now effectively generate multiple-choice questions from slides and lecture notes, while I am not so sure about their ability to create smart open-ended questions.
  &lt;/li&gt;
  &lt;li&gt;
    AI tools are also valuable for quality control: they can verify that questions are clear, that only one answer is correct, and that the solution I give to the teaching assistants matches the correct answer.
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my experience, AI tools have made the creation of multiple-choice exams quite efficient, although I should note that I don't have rigorous evidence that AI tools work better specifically for multiple-choice questions compared to other formats.&lt;/p&gt;
&lt;p&gt;I should also note the disadvantages of multiple-choice questions:&lt;/p&gt;
&lt;ul style="list-style-type: '🟥   '; margin-left: 1em; gap: 0.7em; display: flex; flex-direction: column;"&gt;
  &lt;li&gt;
    It seems obvious that &lt;strong&gt;only some types of learning goals&lt;/strong&gt; can be assessed with multiple-choice questions (typically, remembering facts and terminology, or estimating quantities).
  &lt;/li&gt;
  &lt;li&gt;
    Often, for a given course, there is a &lt;strong&gt;limited number of questions that can be constructed from the course material&lt;/strong&gt;. When creating a new exam in the following year, I found it hard to come up with additional multiple-choice questions that are equally relevant.
  &lt;/li&gt;
  &lt;li&gt;
    If the items are not carefully designed, &lt;strong&gt;guessing&lt;/strong&gt; can be a viable strategy to get a high score. While some teachers try to discourage guessing by penalizing wrong answers, I learned in the course that is not a good solution, because it might discriminate between students with different risk attitudes (which, as was stressed in the course, is sometimes correlated with gender).
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;How to Write Good Multiple-Choice Questions&lt;/h2&gt;
&lt;p&gt;In the course, we submitted some of our own multiple-choice questions, which were then analyzed by the other participants.&lt;/p&gt;
&lt;p&gt;One insight I gained through this exercise was that people from another discipline are often best positioned to evaluate multiple-choice item design. If they can solve a question through guessing, it likely contains unintentional clues in either the question or answer options.&lt;/p&gt;
&lt;p&gt;As an example, let's analyze the following item that was generated by ChatGPT (&lt;a href="https://chatgpt.com/share/6856a026-2bf8-8012-83de-b6a13daeeebc"&gt;https://chatgpt.com/share/6856a026-2bf8-8012-83de-b6a13daeeebc&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="MCQ-1: What does it mean to quantize an LLM?
A.  To increase the model's accuracy by adding more training data.
B.  To reduce the precision of the model's parameters to use fewer bits, improving efficiency.
C.  To translate the model into multiple languages.
D.  To retrain the model using quantum computing principles." class="left-align" src="https://vamvas.ch/assets/multiple-choice/mcq1.png" width="661px:"&gt;&lt;/p&gt;
&lt;p&gt;Some cues that give away the correct answer:&lt;/p&gt;
&lt;ul style="list-style-type: '🟥   '; margin-left: 1em; gap: 0.7em; display: flex; flex-direction: column;"&gt;
  &lt;li&gt;Not all distractors are &lt;strong&gt;equally plausible&lt;/strong&gt;. For example, it's unlikely that the correct answer will have anything to do with quantum computing.&lt;/li&gt;
  &lt;li&gt;The correct answer is often the &lt;strong&gt;longest&lt;/strong&gt; and &lt;strong&gt;most precise&lt;/strong&gt; option, which can unintentionally signal to students which answer to choose.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some other tips and tricks I learned:&lt;/p&gt;
&lt;ul style="list-style-type: '☑️   '; margin-left: 1em; gap: 0.7em; display: flex; flex-direction: column;"&gt;
  &lt;li&gt;
    Research shows that multiple-choice exams have a &lt;strong&gt;diminishing return on the number of distractors&lt;/strong&gt;, or even a negative return. Three options are typically sufficient, and adding a fourth option increases reading time without making the assessment more reliable. Better use that time for additional questions!
  &lt;/li&gt;
  &lt;li&gt;
    &lt;strong&gt;Instruct students to select the "best" answer rather than the "correct" answer.&lt;/strong&gt; This allows me to keep the correct option concise while making the distractors as plausible as possible. For instance, answer &lt;span style="background-color: #E4E7F4; border-radius: 4px; padding: 0 2px;"&gt;B&lt;/span&gt; in the example above isn't technically accurate (quantization can apply to activations, not just parameters), but it's clearly the "best" option among the choices provided.
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Based on these insights, here's an improved version of the item:&lt;/p&gt;
&lt;p&gt;&lt;img alt="MCQ-2: What does it mean to quantize an LLM?
A.  Pruning unnecessary parameters.
B.  Computing some layers in parallel.
C.  Reducing the precision of the weights." class="left-align" src="https://vamvas.ch/assets/multiple-choice/mcq2.png" width="362px:"&gt;&lt;/p&gt;
&lt;p&gt;Interestingly, asking students to select the "best" answer also enables more creative items, such as &lt;strong&gt;estimation questions&lt;/strong&gt; or questions that require &lt;strong&gt;educated guessing&lt;/strong&gt; beyond simple recall. Here's an example:&lt;/p&gt;
&lt;p&gt;&lt;img alt="MCQ-3: Make a guess: How does GPT-4o tokenize the Lithuanian word ‘nebeprisikiškiakopūsteliautum’?
A.  ⟨nebepr, is, ik, iškiak, opūste, liautum⟩
B.  ⟨neb, pre, ski, kayak, opus, tell, autumn⟩
C.  ⟨ne, be, pris, iki, ški, ak, op, ū, stel, ia, utum⟩" class="left-align" src="https://vamvas.ch/assets/multiple-choice/mcq3.png" width="531px:"&gt;&lt;/p&gt;
&lt;p&gt;I like this type of question because it requires higher-level thinking beyond memorization.
However, a student in a recent exam-prep session told me that they find the question unfair (because, as they said, they do not speak Lithuanian). This reaction likely stems from the fact that students are conditioned to view multiple-choice questions as tests of memorization only. Therefore, it's important to &lt;strong&gt;prepare students for the question format in advance&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;The K-prime Question Format&lt;/h2&gt;
&lt;p&gt;In the course, I learned about an interesting question format called &lt;strong&gt;"K-prime"&lt;/strong&gt;.&lt;sup id="sf-the-joy-of-multiple-choice-2-back"&gt;&lt;a href="#sf-the-joy-of-multiple-choice-2" class="simple-footnote" title="The name &amp;quot;K-prime&amp;quot; is derived from &amp;quot;K-type questions&amp;quot;, which are common in medical education. K-type (as opposed to A-type) questions allow for multiple correct answers. The K-prime (or K') format improves over simple K-type questions by asking about the truth of every option individually."&gt;2&lt;/a&gt;&lt;/sup&gt;
K-prime questions might be a Swiss invention, as they were first described in the paper &lt;a href="https://doi.org/10.1007/978-94-011-4886-3_46"&gt;"The Swiss Way to Score Multiple True-False Items"&lt;/a&gt; &lt;a class="citation-link" data-toggle="tooltip" data-html="true" title='René Krebs.
&lt;em&gt;The Swiss Way to Score Multiple True-False Items: Theoretical and Empirical Evidence&lt;/em&gt;, pages 158–161.
Springer Netherlands, Dordrecht, 1997.
URL: &lt;a href="https://doi.org/10.1007/978-94-011-4886-3_46"&gt;https://doi.org/10.1007/978-94-011-4886-3_46&lt;/a&gt;, &lt;a href="https://doi.org/10.1007/978-94-011-4886-3_46"&gt;doi:10.1007/978-94-011-4886-3_46&lt;/a&gt;.' href="#Krebs1997" id="ref-Krebs1997-1"&gt;(Krebs, 1997)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In K-prime items, &lt;strong&gt;every answer option needs to be separately rated as true or false&lt;/strong&gt; by the students. Any number of options can be true (including zero or all of them). There are always four options, and a student gets 2 points if they correctly rated all the four options. To reward partial knowledge, the student gets 1 point if three out of four options are correctly rated. However, they get 0 points if they rate two or fewer options correctly, &lt;strong&gt;which makes guessing ineffective&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In medical education, K-prime questions are particularly useful because many phenomena have multiple causes or factors, as shown in this example from &lt;a class="citation-link" data-toggle="tooltip" data-html="true" title='René Krebs.
&lt;em&gt;The Swiss Way to Score Multiple True-False Items: Theoretical and Empirical Evidence&lt;/em&gt;, pages 158–161.
Springer Netherlands, Dordrecht, 1997.
URL: &lt;a href="https://doi.org/10.1007/978-94-011-4886-3_46"&gt;https://doi.org/10.1007/978-94-011-4886-3_46&lt;/a&gt;, &lt;a href="https://doi.org/10.1007/978-94-011-4886-3_46"&gt;doi:10.1007/978-94-011-4886-3_46&lt;/a&gt;.' href="#Krebs1997" id="ref-Krebs1997-2"&gt;Krebs (1997)&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="MCQ-4: You are reading the chest radiograph of a patient suspected to suffer from emphysema.
This diagnosis is supported by:
A.  Increased retrosternal space.
B.  Blunting of the costophrenic angle.
C.  Horizontal ribs.
D.  Accentuated peripheral pulmonary vascularity." class="left-align" src="https://vamvas.ch/assets/multiple-choice/mcq4.png" width="704px:"&gt;&lt;/p&gt;
&lt;p&gt;K-prime questions could also work well for computer science topics:&lt;/p&gt;
&lt;p&gt;&lt;img alt="MCQ-5: Quantization can be applied to which component(s) of an LLM?
A.  Weights
B.  Activations
C.  Vocabulary
D.  Quantization constants" class="left-align" src="https://vamvas.ch/assets/multiple-choice/mcq5.png" width="672px:"&gt;&lt;/p&gt;
&lt;p&gt;After implementing K-prime questions myself (in a math refresher course), I found that they offer several advantages over traditional multiple-choice questions:&lt;/p&gt;
&lt;ul style="list-style-type: '🟩   '; margin-left: 1em; gap: 0.7em; display: flex; flex-direction: column;"&gt;
  &lt;li&gt;
    Question writing becomes easier, because the &lt;strong&gt;options are independent of each other&lt;/strong&gt;. While there may be some interdependence, this interdependence stems from the content itself rather than from the format of the question (e.g., using a process of elimination to exploit the format).
  &lt;/li&gt;
  &lt;li&gt;
    It will likely become &lt;strong&gt;easier to reuse questions&lt;/strong&gt; in future exams, since answer options can be varied independently from each other. Also, for certain topics, K-prime questions appear to be more straightforward to create than traditional multiple-choice questions, allowing for a larger question pool overall.
  &lt;/li&gt;
    &lt;li&gt;
        The grading scheme &lt;strong&gt;discourages guessing&lt;/strong&gt;, making sure that students' individual risk attitudes don't affect their grade.
    &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Tips and Tricks for Writing Good K-prime Questions&lt;/h2&gt;
&lt;p&gt;Based on my experience writing K-prime questions and supervising teaching assistants who created their own, here are some key tips for writing effective K-prime questions:&lt;/p&gt;
&lt;ul style="list-style-type: '☑️   '; margin-left: 1em; gap: 0.7em; display: flex; flex-direction: column;"&gt;
  &lt;li&gt;&lt;strong&gt;Avoid negations&lt;/strong&gt; (or double negations). Since any statement may be true or false, there is no need to use negations to make a statement fit the item format.
  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Avoid compositions of several propositions&lt;/strong&gt; (e.g., "A and B", or "A, because B"). Simply split them into separate statements.&lt;/li&gt;
  &lt;li&gt;Make sure to balance true and false options: I noticed that &lt;strong&gt;AI tools like ChatGPT default to writing correct options, and incorrect options need to be written manually&lt;/strong&gt;. One explanation might be that unlike standard multiple-choice questions, LLMs haven't seen many K-prime questions in their training data. Since K-prime questions lack the formal constraint that a single option must be correct, LLMs are more influenced by their "truth bias" than anything else. Interestingly, a study has found that human exam creators have a similar bias, choosing to make 60.5% of the options true on average &lt;a class="citation-link" data-toggle="tooltip" data-html="true" title='René Krebs.
&lt;em&gt;&lt;span class="bibtex-protected"&gt;Prüfen mit Multiple Choice Kompetent planen, entwickeln, durchführen und auswerten&lt;/span&gt;&lt;/em&gt;.
Hogrefe Verlag GmbH &amp;amp; Co. KG, 2019.
ISBN 9783456759029.
URL: &lt;a href="https://elibrary.hogrefe.com/book/10.1024/85902-000"&gt;https://elibrary.hogrefe.com/book/10.1024/85902-000&lt;/a&gt;, &lt;a href="https://doi.org/10.1024/85902-000"&gt;doi:10.1024/85902-000&lt;/a&gt;.' href="#krebs2019" id="ref-krebs2019-1"&gt;(Krebs, 2019)&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I have become a believer in multiple-choice questions, but it's clear that they need to be used in combination with other question types.
Due to their efficiency and ease of grading, I am planning to keep them in the mix, so that I can free up resources for more creative and open-ended questions in the rest of the exam.&lt;/p&gt;
&lt;p&gt; &lt;/p&gt;

&lt;p&gt;Answer key for the examples:&lt;/p&gt;
&lt;p style="text-align: right; transform: rotate(180deg);"&gt;
MCQ-1: B, MCQ-2: C, MCQ-3: C, MCQ-4: A, B, C, MCQ-5: A, B, D
&lt;/p&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was written for the course "Teaching Skills – Systematic Development of Teaching Competence" at University of Zurich.&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id="Krebs1997"&gt;René Krebs.
&lt;em&gt;The Swiss Way to Score Multiple True-False Items: Theoretical and Empirical Evidence&lt;/em&gt;, pages 158–161.
Springer Netherlands, Dordrecht, 1997.
URL: &lt;a href="https://doi.org/10.1007/978-94-011-4886-3_46"&gt;https://doi.org/10.1007/978-94-011-4886-3_46&lt;/a&gt;, &lt;a href="https://doi.org/10.1007/978-94-011-4886-3_46"&gt;doi:10.1007/978-94-011-4886-3_46&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-Krebs1997-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-Krebs1997-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-Krebs1997-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id="krebs2019"&gt;René Krebs.
&lt;em&gt;&lt;span class="bibtex-protected"&gt;Prüfen mit Multiple Choice Kompetent planen, entwickeln, durchführen und auswerten&lt;/span&gt;&lt;/em&gt;.
Hogrefe Verlag GmbH &amp;amp; Co. KG, 2019.
ISBN 9783456759029.
URL: &lt;a href="https://elibrary.hogrefe.com/book/10.1024/85902-000"&gt;https://elibrary.hogrefe.com/book/10.1024/85902-000&lt;/a&gt;, &lt;a href="https://doi.org/10.1024/85902-000"&gt;doi:10.1024/85902-000&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-krebs2019-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;&lt;ol class="simple-footnotes"&gt;&lt;li id="sf-the-joy-of-multiple-choice-1"&gt;Organized by Antonia Bonaccorso and Tobias Halbherr of the ETH/UZH Didactica Program. &lt;a href="#sf-the-joy-of-multiple-choice-1-back" class="simple-footnote-back"&gt;↩︎&lt;/a&gt;&lt;/li&gt;&lt;li id="sf-the-joy-of-multiple-choice-2"&gt;The name "K-prime" is derived from "K-type questions", which are common in medical education. K-type (as opposed to A-type) questions allow for multiple correct answers. The K-prime (or K') format improves over simple K-type questions by asking about the truth of every option individually. &lt;a href="#sf-the-joy-of-multiple-choice-2-back" class="simple-footnote-back"&gt;↩︎&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;</content><category term="blog"></category></entry><entry><title>OpenAI's Speculative Decoding, Reverse-Engineered</title><link href="https://vamvas.ch/openai-predicted-outputs" rel="alternate"></link><published>2025-04-21T00:00:00+02:00</published><updated>2025-04-21T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2025-04-21:/openai-predicted-outputs</id><summary type="html">&lt;p&gt;Why LLMs are faster if we give them a draft to complete.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Recently, OpenAI introduced a &lt;a href="https://platform.openai.com/docs/guides/predicted-outputs"&gt;&lt;em&gt;Predicted Outputs&lt;/em&gt;&lt;/a&gt; feature for its GPT-4o language model. This feature promises to speed up text generation when the output will largely match a known pattern (e.g., if the language model needs to repeat an input text with minor changes or additions):&lt;/p&gt;
&lt;div style="display: flex; justify-content: center;"&gt;
&lt;blockquote class="twitter-tweet" data-dnt="true"&gt;&lt;p lang="en" dir="ltr"&gt;Introducing Predicted Outputs—dramatically decrease latency for gpt-4o and gpt-4o-mini by providing a reference string. &lt;a href="https://t.co/n6mqjQwQV1"&gt;https://t.co/n6mqjQwQV1&lt;/a&gt;&lt;br&gt;&lt;br&gt;Speed up:&lt;br&gt;- Updating a blog post in a doc&lt;br&gt;- Iterating on prior responses&lt;br&gt;- Rewriting code in an existing file, like &lt;a href="https://twitter.com/exponent_run?ref_src=twsrc%5Etfw"&gt;&amp;#64;exponent_run&lt;/a&gt; here: &lt;a href="https://t.co/c9O3YtHH7N"&gt;pic.twitter.com/c9O3YtHH7N&lt;/a&gt;&lt;/p&gt;&amp;mdash; OpenAI Developers &amp;#40;&amp;#64;OpenAIDevs&amp;#41; &lt;a href="https://twitter.com/OpenAIDevs/status/1853564730872607229?ref_src=twsrc%5Etfw"&gt;November 4, 2024&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;/div&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;What sparked my curiosity was the fact that OpenAI's &lt;a href="https://platform.openai.com/docs/guides/predicted-outputs"&gt;documentation&lt;/a&gt; for this feature explains very little – too little, for my taste. They provide an example output, but the example is clearly wrong, and you can't reproduce it via the API.&lt;/p&gt;
&lt;p&gt;So I set out to reverse-engineer the algorithm behind this feature, both to understand it myself and to be able to better explain it to my students.&lt;/p&gt;
&lt;p&gt;The idea of providing a language model with a draft of the output is a well-known concept in NLP, and looks something like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="predicted-output-illustration.png" src="../assets/openai-predicted-outputs/predicted-output-illustration.png" width="450px"&gt;&lt;/p&gt;
&lt;p&gt;If the Predicted Outputs feature works this way, then it would be similar to &lt;em&gt;Aggressive Decoding&lt;/em&gt; &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang.
Instantaneous grammatical error correction with shallow aggressive decoding.
In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, &lt;em&gt;Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)&lt;/em&gt;, 5937–5947. Online, August 2021. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2021.acl-long.462/"&gt;https://aclanthology.org/2021.acl-long.462/&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2021.acl-long.462"&gt;doi:10.18653/v1/2021.acl-long.462&lt;/a&gt;.' href='#sun-etal-2021-instantaneous' id='ref-sun-etal-2021-instantaneous-1'&gt;(Sun et al., 2021)&lt;/a&gt;, or &lt;a href="https://github.com/apoorvumang/prompt-lookup-decoding"&gt;&lt;em&gt;Prompt Lookup Decoding&lt;/em&gt;&lt;/a&gt;, which is implemented in &lt;a href="https://huggingface.co/docs/transformers/v4.51.1/en/llm_optims#prompt-lookup-decoding"&gt;HF Transformers&lt;/a&gt; and &lt;a href="https://docs.vllm.ai/en/v0.8.2/features/spec_decode.html#speculating-by-matching-n-grams-in-the-prompt"&gt;vLLM&lt;/a&gt;. There are also clear parallels to &lt;em&gt;Speculative Decoding&lt;/em&gt; (&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Yaniv Leviathan, Matan Kalman, and Yossi Matias.
Fast inference from transformers via speculative decoding.
In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, &lt;em&gt;Proceedings of the 40th International Conference on Machine Learning&lt;/em&gt;, volume 202 of Proceedings of Machine Learning Research, 19274–19286. PMLR, 23–29 Jul 2023.
URL: &lt;a href="https://proceedings.mlr.press/v202/leviathan23a.html"&gt;https://proceedings.mlr.press/v202/leviathan23a.html&lt;/a&gt;.' href='#pmlr-v202-leviathan23a' id='ref-pmlr-v202-leviathan23a-1'&gt;Leviathan et al. (2023)&lt;/a&gt;, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper.
Accelerating large language model decoding with speculative sampling.
2023.
URL: &lt;a href="https://arxiv.org/abs/2302.01318"&gt;https://arxiv.org/abs/2302.01318&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2302.01318"&gt;arXiv:2302.01318&lt;/a&gt;.' href='#chen2023acceleratinglargelanguagemodel' id='ref-chen2023acceleratinglargelanguagemodel-1'&gt;Chen et al. (2023)&lt;/a&gt;), which uses a draft language model to predict the draft dynamically.&lt;/p&gt;
&lt;p&gt;What makes OpenAI's algorithm different is that the draft is provided separate from the prompt, as an additional API parameter. Only one draft can be provided. Besides the generated response, the API response includes statistics of how many tokens in the draft were accepted and how many were rejected:&lt;/p&gt;
&lt;p&gt;&lt;img alt="api-response.png" src="../assets/openai-predicted-outputs/api-response.png" width="420px"&gt;&lt;/p&gt;
&lt;p&gt;This doesn't make the response cheaper – in fact, rejected tokens are billed in addition to any accepted or additionally generated tokens – but it can speed up the response, since it is faster to verify a draft than to generate the tokens from scratch.&lt;/p&gt;
&lt;h2&gt;Key Concepts&lt;/h2&gt;
&lt;p&gt;To understand how the Predicted Outputs feature likely works, let's introduce some key concepts:&lt;/p&gt;
&lt;h3&gt;Lossless Acceleration&lt;/h3&gt;
&lt;p&gt;Algorithms like Speculative Decoding are &lt;em&gt;lossless&lt;/em&gt; in the sense that they do not alter the generated response – the output remains exactly the same as it would be without the draft. The sole purpose of the draft is to accelerate text generation. If the LLM disagrees with any portion of the draft, the algorithm simply ignores that part. Similarly, the draft won't nudge the LLM toward a specific output. For instance, a draft cannot be used to make the LLM follow a particular output format.&lt;/p&gt;
&lt;h3&gt;Verification&lt;/h3&gt;
&lt;p&gt;A draft accelerates text generation because its tokens can be verified in parallel, whereas text generation from scratch cannot be parallelized and works in a sequential manner, token by token:&lt;/p&gt;
&lt;p&gt;&lt;img alt="draft-verification-vs-autoregressive.png" src="../assets/openai-predicted-outputs/draft-verification-vs-autoregressive.png" width="550px"&gt;&lt;/p&gt;
&lt;p&gt;For simplicity, let's consider the case of &lt;em&gt;greedy decoding&lt;/em&gt;, where the most likely token is selected at each step (as opposed to &lt;em&gt;sampling&lt;/em&gt;, where tokens are selected with some randomness). In this context, verification means checking whether each draft token is indeed the most likely token according to the model, given the previous tokens. In the example above, this is true for the first three tokens in the draft, 'a', 'b' and 'c', but false for the fourth token, where the language model predicts 'd' as the most likely token.&lt;/p&gt;
&lt;h3&gt;Lookahead Parameter&lt;/h3&gt;
&lt;p&gt;In practice, verification is limited to a fixed number of draft tokens at a time. Since the draft is likely to be rejected at some point, any computation after the rejection point would be wasted. Defining a lookahead parameter helps reduce the expected number of rejected tokens.&lt;/p&gt;
&lt;h2&gt;OpenAI's Predicted Outputs – Reverse Engineering Report&lt;/h2&gt;
&lt;p&gt;To examine the Predicted Outputs feature, I relied on simple text generation tasks that involve enumerating the Latin alphabet. This has the advantage that a model like gpt-4o-mini is definitely able to solve the task, and so will always accept a correct draft and reject an incorrect draft. Furthermore, the gpt-4o/gpt-4o-mini tokenizer has a separate subword token for each letter in the alphabet, which makes it easy to &lt;a href="https://platform.openai.com/tokenizer"&gt;calculate the expected number of tokens&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="gpt-4o-tokenizer-abc.png" src="../assets/openai-predicted-outputs/gpt-4o-tokenizer-abc.png"&gt;&lt;/p&gt;
&lt;p&gt;For basic tests, I simply asked GPT to list the letters in the alphabet:&lt;/p&gt;
&lt;p&gt;&lt;img alt="gpt-4o-mini-alphabet.png" class="img-shadow" src="../assets/openai-predicted-outputs/gpt-4o-mini-alphabet.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;When testing with longer sequences, I used a task where two letters need to be systematically paired:&lt;/p&gt;
&lt;p&gt;&lt;img alt="gpt-4o-mini-cartesian-alphabet.png" class="img-shadow" src="../assets/openai-predicted-outputs/gpt-4o-mini-cartesian-alphabet.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3&gt;Observation 1: Predicted Outputs is somewhat stochastic.&lt;/h3&gt;
&lt;p&gt;When repeating the same API call 25 times, four different responses are recorded:&lt;/p&gt;
&lt;div style="display: flex; justify-content: center;"&gt;
&lt;iframe width="464" height="219" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vToM0VITyG2jcl3ZnWuoKIc6MFfFHkL15RRohcesP6WV9jZbDzo63HNlGVddZixLiwgrW7kNndbz8ov/pubchart?oid=107196100&amp;amp;format=interactive"&gt;&lt;/iframe&gt;
&lt;/div&gt;

&lt;p&gt;This is with temperature 0. As expected, the model returns the same completion every time, but despite that, the number of accepted and rejected tokens varies.&lt;/p&gt;
&lt;p&gt;One possible explanation is that OpenAI might batch my verification steps with verification steps from other users' requests, and the lookahead parameter could depend on how many tokens are available in the batch.
However, it remains a puzzling observation, without a definitive explanation.&lt;/p&gt;
&lt;p&gt;&lt;span style="display: inline-flex; align-items: center;"&gt;
  Code to reproduce this:&amp;nbsp;&amp;nbsp;
  &lt;a target="_blank"
     href="https://colab.research.google.com/drive/1p6Yudr1vXybk-7fqoZPPdGFCmeGd81f7#scrollTo=bw0Sd0ox-4rl"&gt;
    &lt;img src="https://colab.research.google.com/assets/colab-badge.svg"
         alt="Open in Colab"
         width="150"
         style="vertical-align: middle; margin: 0" /&gt;
  &lt;/a&gt;
&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3&gt;Observation 2: The lookahead parameter is &lt;em&gt;K&lt;/em&gt;=16.&lt;/h3&gt;
&lt;p&gt;Next, let's analyze the size of the lookahead window. For this, we can simply provide a draft that is correct up to a point, and then continues with incorrect tokens. By varying the position where the draft begins to be incorrect ("bifurcation position"), we can track the behavior of the API:&lt;/p&gt;
&lt;div style="display: flex; justify-content: center;"&gt;
&lt;iframe width="511" height="317" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vToM0VITyG2jcl3ZnWuoKIc6MFfFHkL15RRohcesP6WV9jZbDzo63HNlGVddZixLiwgrW7kNndbz8ov/pubchart?oid=719492994&amp;amp;format=interactive"&gt;&lt;/iframe&gt;
&lt;/div&gt;

&lt;p&gt;While there are some minor variations in the graph, the trend is clear: The number of rejected tokens in this scenario never exceeds 16.
This strongly indicates that OpenAI's lookahead parameter &lt;em&gt;K&lt;/em&gt; is equal to 16: The algorithm verifies up to 16 tokens at a time, and therefore only rejects up to 16 tokens. After the first rejection, the algorithm likely generates the remainder of the sequence token by token.&lt;/p&gt;
&lt;p&gt;&lt;span style="display: inline-flex; align-items: center;"&gt;
  Code to reproduce this:&amp;nbsp;&amp;nbsp;
  &lt;a target="_blank"
     href="https://colab.research.google.com/drive/1p6Yudr1vXybk-7fqoZPPdGFCmeGd81f7#scrollTo=cLeDlZemGC0-"&gt;
    &lt;img src="https://colab.research.google.com/assets/colab-badge.svg"
         alt="Open in Colab"
         width="150"
         style="vertical-align: middle; margin: 0" /&gt;
  &lt;/a&gt;
&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3&gt;Observation 3: Inserted, deleted, and replaced tokens are all treated the same way.&lt;/h3&gt;
&lt;p&gt;As one might expect, let's verify that the algorithm handles all types of discrepancies between the draft and the expected output in the same manner. Whether tokens are added, missing, or replaced in the draft, all these cases result in rejection of the affected tokens.&lt;/p&gt;
&lt;div style="display: flex; flex-direction: column; gap: 20px; margin: 20px 0;"&gt;
  &lt;!-- Correct draft --&gt;
  &lt;div&gt;
    &lt;h4&gt;Correct draft&lt;/h4&gt;
    &lt;div style="display: flex; gap: 5px; margin-bottom: 10px;"&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;c&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;d&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;e&lt;/span&gt;
    &lt;/div&gt;
    &lt;p&gt;Accepted 5 – rejected 0.&lt;/p&gt;
  &lt;/div&gt;

  &lt;!-- Draft with added token --&gt;
  &lt;div&gt;
    &lt;h4&gt;Draft with added token: 'X' inserted between 'c' and 'd'&lt;/h4&gt;
    &lt;div style="display: flex; gap: 5px; margin-bottom: 10px;"&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;c&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;d&lt;/span&gt;
    &lt;/div&gt;
    &lt;p&gt;Accepted 3 – rejected 2.&lt;/p&gt;
  &lt;/div&gt;

  &lt;!-- Draft with deleted token --&gt;
  &lt;div&gt;
    &lt;h4&gt;Draft with missing token: 'd' deleted&lt;/h4&gt;
    &lt;div style="display: flex; gap: 5px; margin-bottom: 10px;"&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;c&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;e&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;f&lt;/span&gt;
    &lt;/div&gt;
    &lt;p&gt;Accepted 3 – rejected 2.&lt;/p&gt;
  &lt;/div&gt;

  &lt;!-- Draft with replaced token --&gt;
  &lt;div&gt;
    &lt;h4&gt;Draft with wrong token: 'd' replaced with 'X'&lt;/h4&gt;
    &lt;div style="display: flex; gap: 5px; margin-bottom: 10px;"&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;c&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;e&lt;/span&gt;
    &lt;/div&gt;
    &lt;p&gt;Accepted 3 – rejected 2.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;&lt;span style="display: inline-flex; align-items: center;"&gt;
  Code to reproduce this:&amp;nbsp;&amp;nbsp;
  &lt;a target="_blank"
     href="https://colab.research.google.com/drive/1p6Yudr1vXybk-7fqoZPPdGFCmeGd81f7#scrollTo=SKtK3JOqFnHg"&gt;
    &lt;img src="https://colab.research.google.com/assets/colab-badge.svg"
         alt="Open in Colab"
         width="150"
         style="vertical-align: middle; margin: 0" /&gt;
  &lt;/a&gt;
&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3&gt;Observation 4: The algorithm can recover after a rejection.&lt;/h3&gt;
&lt;p&gt;While an error in the draft will lead to rejected tokens, more tokens can still be accepted if the remainder of the draft is correct.&lt;/p&gt;
&lt;p&gt;However, recovery doesn't happen immediately after the bifurcation position. Instead, there appears to be a specific number of tokens in the draft that need to be confirmed before verification resumes:&lt;/p&gt;
&lt;!-- Correct draft --&gt;
&lt;div&gt;
    &lt;h4&gt;Correct draft&lt;/h4&gt;
    &lt;div style="display: flex; gap: 5px; margin-bottom: 10px; flex-wrap: wrap;"&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;c&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;d&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;e&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;f&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;g&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;h&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;i&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;j&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;k&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;l&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;m&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;n&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;o&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;p&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;q&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;r&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;s&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;t&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;u&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;v&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;w&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;x&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;y&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;z&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;c&lt;/span&gt;
    &lt;/div&gt;
    &lt;p&gt;Accepted 55 – rejected 0.&lt;/p&gt;
  &lt;/div&gt;

&lt;!-- Draft with insertion error --&gt;
&lt;div&gt;
    &lt;h4&gt;Draft with added tokens&lt;/h4&gt;
    &lt;div style="display: flex; gap: 5px; margin-bottom: 10px; flex-wrap: wrap;"&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;c&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;d&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;e&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;f&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;g&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;h&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;i&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;j&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;k&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;l&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;m&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;n&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;o&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;p&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;q&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;r&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;s&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;t&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;u&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;v&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;w&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;x&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;y&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;z&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
    &lt;/div&gt;
    &lt;p&gt;Accepted 21 – rejected 11.&lt;/p&gt;
  &lt;/div&gt;

&lt;!-- Draft with deletion error --&gt;
&lt;div&gt;
    &lt;h4&gt;Draft with missing tokens&lt;/h4&gt;
    &lt;div style="display: flex; gap: 5px; margin-bottom: 10px; flex-wrap: wrap;"&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;e&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;f&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;g&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;h&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;i&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;j&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;k&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;l&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;m&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;n&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;o&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;p&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;q&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;r&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;s&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;t&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;u&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;v&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;w&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;x&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;y&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;z&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;c&lt;/span&gt;
    &lt;/div&gt;
    &lt;p&gt;Accepted 22 – rejected 11.&lt;/p&gt;
  &lt;/div&gt;

&lt;!-- Draft with replacement error --&gt;
&lt;div&gt;
    &lt;h4&gt;Draft with replaced tokens&lt;/h4&gt;
    &lt;div style="display: flex; gap: 5px; margin-bottom: 10px; flex-wrap: wrap;"&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;X&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;e&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;f&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;g&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #FBC6D4; padding: 5px 10px; border-radius: 4px;"&gt;h&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;i&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;j&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;k&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;l&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;m&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;n&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;o&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;p&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;q&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;r&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;s&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;t&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;u&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;v&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;w&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;x&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;y&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;z&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;a&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: #BACBFF; padding: 5px 10px; border-radius: 4px;"&gt;b&lt;/span&gt;
      &lt;span style="background-color: whitesmoke; padding: 5px 10px; border-radius: 4px;"&gt;c&lt;/span&gt;
    &lt;/div&gt;
    &lt;p&gt;Accepted 21 – rejected 11.&lt;/p&gt;
  &lt;/div&gt;

&lt;p&gt;The examples above suggest that the number of tokens in the draft that need to be confirmed before the model can recover is approximately 32, which is twice the lookahead parameter.&lt;/p&gt;
&lt;p&gt;I'm referring to this parameter as the "recovery threshold," but if it's already known by a different name in the literature, I'd be interested to know.&lt;/p&gt;
&lt;p&gt;&lt;span style="display: inline-flex; align-items: center;"&gt;
  Code to reproduce this:&amp;nbsp;&amp;nbsp;
  &lt;a target="_blank"
     href="https://colab.research.google.com/drive/1p6Yudr1vXybk-7fqoZPPdGFCmeGd81f7#scrollTo=24MC_URVMxQ9&amp;line=1&amp;uniqifier=1"&gt;
    &lt;img src="https://colab.research.google.com/assets/colab-badge.svg"
         alt="Open in Colab"
         width="150"
         style="vertical-align: middle; margin: 0" /&gt;
  &lt;/a&gt;
&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3&gt;Bonus Observation: The "extra token" is not counted as accepted/rejected.&lt;/h3&gt;
&lt;p&gt;In the case of the Correct Draft above, 58 tokens were provided as a draft, but only 55 were accepted according to the API. The remaining three were neither accepted nor rejected, and I colored them in gray.&lt;/p&gt;
&lt;p&gt;This is because during a verification step, we can not only verify &lt;em&gt;K&lt;/em&gt; tokens, but we can also simultaneously predict the &lt;em&gt;K&lt;/em&gt;+1st token, given the previous &lt;em&gt;K&lt;/em&gt; tokens. This "extra token" can then be used for free, if and only if all previous &lt;em&gt;K&lt;/em&gt; tokens were accepted.&lt;/p&gt;
&lt;p&gt;OpenAI appears to bill the "extra token" as a token generated by the model, rather than a token from the draft. This is technically accurate, and since all tokens are charged at the same rate (as of writing this), it doesn't really matter that much.&lt;/p&gt;
&lt;h2&gt;Hypothesized Algorithm&lt;/h2&gt;
&lt;p&gt;Based on this analysis, we can now (more or less) formulate the algorithm that OpenAI appears to be using:&lt;/p&gt;
&lt;p&gt;&lt;img alt="predicted-outputs-hypothesized-algorithm.png" src="../assets/openai-predicted-outputs/predicted-outputs-hypothesized-algorithm.png"&gt;&lt;/p&gt;
&lt;h2&gt;Evaluation&lt;/h2&gt;
&lt;p&gt;To check whether my implementation of the algorithm matches the API's behavior, I systematically tested scenarios where the language model needs to insert missing tokens into the draft. I varied both the position of the insertion and the number of tokens that needed to be inserted.&lt;/p&gt;
&lt;p&gt;As established earlier, I set the lookahead parameter to 16 tokens and the recovery threshold to 32 tokens. Like before, I kept the temperature at zero.&lt;/p&gt;
&lt;p&gt;When comparing the number of accepted and rejected tokens from my simulation with those returned by the OpenAI API, I found the results to be similar:&lt;/p&gt;
&lt;p&gt;&lt;img alt="predicted-outputs-simulation.png" src="../assets/openai-predicted-outputs/predicted-outputs-simulation.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;span style="display: inline-flex; align-items: center;"&gt;
  Code to reproduce this:&amp;nbsp;&amp;nbsp;
  &lt;a target="_blank"
     href="https://colab.research.google.com/drive/1p6Yudr1vXybk-7fqoZPPdGFCmeGd81f7#scrollTo=VooPbBtQ3tUV"&gt;
    &lt;img src="https://colab.research.google.com/assets/colab-badge.svg"
         alt="Open in Colab"
         width="150"
         style="vertical-align: middle; margin: 0" /&gt;
  &lt;/a&gt;
&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;OpenAI's Predicted Outputs feature may lack documentation, but the algorithm behind it can be largely reverse-engineered by analyzing the number of accepted and rejected tokens for a given input.&lt;/p&gt;
&lt;p&gt;A limitation of this analysis is that it only looks at greedy decoding. Generalizing the analysis to temperatures greater than zero would be interesting.
Also, an open question is why the API responses are so stochastic. However, this is mostly an academic question, since the stochastic nature of the API is unlikely to be a deal-breaker for most practical applications.&lt;/p&gt;
&lt;h2&gt;Further Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/guides/predicted-outputs"&gt;Documentation by OpenAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/horseee/Awesome-Efficient-LLM/blob/main/inference_acceleration.md"&gt;Awesome-Efficient-LLM – A curated paper list on inference acceleration&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='chen2023acceleratinglargelanguagemodel'&gt;Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper.
Accelerating large language model decoding with speculative sampling.
2023.
URL: &lt;a href="https://arxiv.org/abs/2302.01318"&gt;https://arxiv.org/abs/2302.01318&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2302.01318"&gt;arXiv:2302.01318&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-chen2023acceleratinglargelanguagemodel-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='pmlr-v202-leviathan23a'&gt;Yaniv Leviathan, Matan Kalman, and Yossi Matias.
Fast inference from transformers via speculative decoding.
In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, &lt;em&gt;Proceedings of the 40th International Conference on Machine Learning&lt;/em&gt;, volume 202 of Proceedings of Machine Learning Research, 19274–19286. PMLR, 23–29 Jul 2023.
URL: &lt;a href="https://proceedings.mlr.press/v202/leviathan23a.html"&gt;https://proceedings.mlr.press/v202/leviathan23a.html&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-pmlr-v202-leviathan23a-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='sun-etal-2021-instantaneous'&gt;Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang.
Instantaneous grammatical error correction with shallow aggressive decoding.
In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, &lt;em&gt;Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)&lt;/em&gt;, 5937–5947. Online, August 2021. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2021.acl-long.462/"&gt;https://aclanthology.org/2021.acl-long.462/&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2021.acl-long.462"&gt;doi:10.18653/v1/2021.acl-long.462&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-sun-etal-2021-instantaneous-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>An Encoder Model for Swiss German</title><link href="https://vamvas.ch/swiss-german-encoder" rel="alternate"></link><published>2024-01-23T00:00:00+01:00</published><updated>2024-01-23T00:00:00+01:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2024-01-23:/swiss-german-encoder</id><summary type="html">&lt;p&gt;SwissBERT can now process written Swiss German.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Last year, &lt;a href="/introducing-swissbert"&gt;I announced SwissBERT&lt;/a&gt;, a multilingual encoder model that we trained on news articles from Switzerland.&lt;/p&gt;
&lt;p&gt;At the time, we found that SwissBERT had good accuracy on Switzerland-related NLP tasks such as named entity recognition and stance detection, compared to similar models that were not trained on those data. We were especially impressed by the model's performance on Romansh input text, for which we had little training data and for which no previous language model existed.&lt;a href="#fn1"&gt;*&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;A limitation of SwissBERT is that it has only been trained on news articles. In Switzerland, there is a stark difference between Standard German, as used in newspapers, and Swiss German dialect. Dialect is not traditionally written, but has become ubiquitous on social media, in text messages and other informal contexts:&lt;/p&gt;
&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="de" dir="ltr"&gt;Super happy für min &lt;a href="https://twitter.com/FC_Basel?ref_src=twsrc%5Etfw"&gt;&amp;#64;FC_Basel&lt;/a&gt; 3-0 super Resultat! Das git mir so richtig lust uf min match morn im halbfinal!!! &lt;a href="https://twitter.com/hashtag/Dangge?src=hash&amp;amp;ref_src=twsrc%5Etfw"&gt;#Dangge&lt;/a&gt;&lt;/p&gt;&amp;mdash; Roger Federer (&amp;#64;rogerfederer) &lt;a href="https://twitter.com/rogerfederer/status/439136373527019520?ref_src=twsrc%5Etfw"&gt;February 27, 2014&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;An ideal encoder model for Switzerland should thus be able to process written text in both Standard German and Swiss German (in addition to Romansh, French, and Italian).&lt;/p&gt;
&lt;h2&gt;Adding Swiss German to SwissBERT&lt;/h2&gt;
&lt;p&gt;In a &lt;a href="https://arxiv.org/abs/2401.14400"&gt;new paper that we present at the MOOMIN Workshop on Modular and Open Multilingual NLP&lt;/a&gt;, we propose an updated version of SwissBERT that can do just that. We trained the model on two new datasets:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://icosys.ch/swisscrawl"&gt;SwissCrawl&lt;/a&gt;&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu&amp;nbsp;Cristian Musat, and Andreas Fischer.
Automatic creation of text corpora for low-resource languages from the internet: the case of swiss german.
In &lt;em&gt;Proceedings of The 12th Language Resources and Evaluation Conference&lt;/em&gt;, 2706–2711. Marseille, France, May 2020. European Language Resources Association.
URL: &lt;a href="https://www.aclweb.org/anthology/2020.lrec-1.329"&gt;https://www.aclweb.org/anthology/2020.lrec-1.329&lt;/a&gt;.' href='#linder2020crawler' id='ref-linder2020crawler-1'&gt;(Linder et al., 2020)&lt;/a&gt;, a collection of Swiss German web text (forum discussions, social media).&lt;/li&gt;
&lt;li&gt;A dataset of Swiss German tweets that I collected during my master studies at LMU Munich.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We evaluated the model on three Swiss German tasks and found that adding Swiss German to the training data generally leads to a clear improvement in accuracy:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Accuracy of SwissBERT on three Swiss German NLP tasks" src="https://vamvas.ch/assets/swiss-german-encoder/swiss-german-results.png" width="450px"&gt;&lt;/p&gt;
&lt;p&gt;The updated model is available on the &lt;a href="https://huggingface.co/ZurichNLP/swissbert"&gt;Hugging Face hub&lt;/a&gt;. Note that due to the data licenses, use of the model is restricted to research purposes.&lt;/p&gt;
&lt;h2&gt;Modular Adaptation through Adapters&lt;/h2&gt;
&lt;p&gt;&lt;a href="/introducing-swissbert"&gt;In an earlier post&lt;/a&gt;, I described the modular architecture that we used for SwissBERT, called X-MOD&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jonas Pfeiffer, Naman Goyal, Xi&amp;nbsp;Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe.
Lifting the curse of multilinguality by pre-training modular transformers.
In &lt;em&gt;Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies&lt;/em&gt;, 3479–3495. Seattle, United States, July 2022. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2022.naacl-main.255"&gt;https://aclanthology.org/2022.naacl-main.255&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2022.naacl-main.255"&gt;doi:10.18653/v1/2022.naacl-main.255&lt;/a&gt;.' href='#pfeiffer-etal-2022-lifting' id='ref-pfeiffer-etal-2022-lifting-1'&gt;(Pfeiffer et al., 2022)&lt;/a&gt;. The idea is to have a single encoder model that can be adapted to different languages by adding a language adapter module for each language. The adapter is activated only when the model is processing input in the given language.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The original SwissBERT model has four language adapters for the four national languages of Switzerland." src="https://vamvas.ch/assets/swiss-german-encoder/original-swissbert-adaptation.png" width="550px"&gt;&lt;/p&gt;
&lt;p&gt;The original SwissBERT model has four language adapters for the four national languages of Switzerland (&lt;em&gt;de_CH&lt;/em&gt; = Swiss Standard German, &lt;em&gt;fr_CH&lt;/em&gt; = French, &lt;em&gt;it_CH&lt;/em&gt; = Italian, &lt;em&gt;rm_CH&lt;/em&gt; = Romansh Grischun).
Adding Swiss German to the model is straightforward: We simply add a fifth adapter for Swiss German (&lt;em&gt;gsw&lt;/em&gt;). All the other modules stay exactly the same:
&lt;img alt="The new version of SwissBERT contains a fifth adapter, representing Swiss German." src="https://vamvas.ch/assets/swiss-german-encoder/swissbert-swiss-german-adaptation.png" width="315px"&gt;&lt;/p&gt;
&lt;p&gt;We compared this strategy to a baseline where we update the Transformer layers and embeddings as well, when we train on Swiss German. We found that our modular approach reaches 97.5% of the accuracy of the baseline, while the multilinguality of the model is guaranteed to be preserved.&lt;/p&gt;
&lt;h2&gt;More Findings 🤓&lt;/h2&gt;
&lt;p&gt;In the &lt;a href="https://arxiv.org/abs/2401.14400"&gt;paper&lt;/a&gt;, we perform some more experiments – beyond SwissBERT – which are especially relevant if you're planning to train your own model on Swiss German. Here's the highlights:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;XLM-R is a surprisingly good baseline.&lt;/strong&gt; Obviously, the multilingual model &lt;a href="https://huggingface.co/xlm-roberta-base"&gt;XLM-R&lt;/a&gt;&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm&lt;span class="bibtex-protected"&gt;á&lt;/span&gt;n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.
Unsupervised cross-lingual representation learning at scale.
In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, &lt;em&gt;Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 8440–8451. Online, July 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.acl-main.747"&gt;https://aclanthology.org/2020.acl-main.747&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.acl-main.747"&gt;doi:10.18653/v1/2020.acl-main.747&lt;/a&gt;.' href='#conneau-etal-2020-unsupervised' id='ref-conneau-etal-2020-unsupervised-1'&gt;(Conneau et al., 2020)&lt;/a&gt; does not work well with Swiss German, because it was never trained on text in this language. However, if we just continue the pre-training of XLM-R on our Swiss German dataset, the average accuracy is actually higher than that of the adapted SwissBERT. &lt;a href="https://huggingface.co/ZurichNLP/swiss-german-xlm-roberta-base"&gt;We release our Swiss German XLM-R model here.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Character-level modeling is good for cross-lingual retrieval.&lt;/strong&gt; The spelling of Swiss German text is highly variable, which motivated us to also try adapting a multilingual character-level model. We adapted &lt;a href="https://huggingface.co/google/canine-s"&gt;CANINE&lt;/a&gt;&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jonathan&amp;nbsp;H. Clark, Dan Garrette, Iulia Turc, and John Wieting.
&lt;span class="bibtex-protected"&gt;Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation&lt;/span&gt;.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 10:73&amp;ndash;91, 01 2022.
URL: &lt;a href="https://doi.org/10.1162/tacl\_a\_00448"&gt;https://doi.org/10.1162/tacl\_a\_00448&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00448/1985933/tacl\_a\_00448.pdf"&gt;arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00448/1985933/tacl\_a\_00448.pdf&lt;/a&gt;, &lt;a href="https://doi.org/10.1162/tacl_a_00448"&gt;doi:10.1162/tacl_a_00448&lt;/a&gt;.' href='#clark-etal-2022-canine' id='ref-clark-etal-2022-canine-1'&gt;(Clark et al., 2022)&lt;/a&gt; to Swiss German and found that on part-of-speech tagging, it achieves a much lower accuracy than the subword-based XLM-R and SwissBERT models. But CANINE achieves the best accuracy on the sentence retrieval task, which to us was surprising and could be an inspiration for future work. &lt;a href="https://huggingface.co/ZurichNLP/swiss-german-canine"&gt;The Swiss German CANINE model is available here.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A custom Swiss German subword vocabulary is not beneficial.&lt;/strong&gt; Given the spelling differences between Standard German and Swiss German, one might think that a custom subword vocabulary is needed for Swiss German. However, we found that re-using the existing vocabularies of XLM-R and SwissBERT worked better. In addition to the spelling variation in Switzerland, which might make compression harder, another likely reason is that the Swiss German dataset is relatively small, so the model might not be able to learn good word embeddings from scratch.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Equipped with a Swiss German adapter, the &lt;a href="https://huggingface.co/ZurichNLP/swissbert"&gt;SwissBERT&lt;/a&gt; model is now a more complete text encoder and covers not only the four national languages of Switzerland, but also a family of dialects that is spoken (and written) by 5 million people in Switzerland.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p id="fn1"&gt;*You might ask: Couldn't ChatGPT do these tasks as well? Yes, maybe it could. But an encoder like SwissBERT is a lot smaller, and, as of today, much faster. One use case where efficiency is important is processing a large document collection, e.g., for creating an embedding-based search index. Such an index is often needed for enabling retrieval augmentation of large language models.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The work presented in this post is joint work with &lt;a href="https://noe-eva.github.io/"&gt;Noëmi Aepli&lt;/a&gt; and &lt;a href="https://www.cl.uzh.ch/de/about-us/people/team/compling/sennrich.html"&gt;Rico Sennrich&lt;/a&gt;. It was funded by the Swiss National Science Foundation (project nos. &lt;a href="https://data.snf.ch/grants/grant/213976"&gt;213976&lt;/a&gt; and &lt;a href="https://data.snf.ch/grants/grant/191934"&gt;191934&lt;/a&gt;).&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='clark-etal-2022-canine'&gt;Jonathan&amp;nbsp;H. Clark, Dan Garrette, Iulia Turc, and John Wieting.
&lt;span class="bibtex-protected"&gt;Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation&lt;/span&gt;.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 10:73&amp;ndash;91, 01 2022.
URL: &lt;a href="https://doi.org/10.1162/tacl\_a\_00448"&gt;https://doi.org/10.1162/tacl\_a\_00448&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00448/1985933/tacl\_a\_00448.pdf"&gt;arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00448/1985933/tacl\_a\_00448.pdf&lt;/a&gt;, &lt;a href="https://doi.org/10.1162/tacl_a_00448"&gt;doi:10.1162/tacl_a_00448&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-clark-etal-2022-canine-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='conneau-etal-2020-unsupervised'&gt;Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm&lt;span class="bibtex-protected"&gt;á&lt;/span&gt;n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.
Unsupervised cross-lingual representation learning at scale.
In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, &lt;em&gt;Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 8440–8451. Online, July 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.acl-main.747"&gt;https://aclanthology.org/2020.acl-main.747&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.acl-main.747"&gt;doi:10.18653/v1/2020.acl-main.747&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-conneau-etal-2020-unsupervised-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='linder2020crawler'&gt;Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu&amp;nbsp;Cristian Musat, and Andreas Fischer.
Automatic creation of text corpora for low-resource languages from the internet: the case of swiss german.
In &lt;em&gt;Proceedings of The 12th Language Resources and Evaluation Conference&lt;/em&gt;, 2706–2711. Marseille, France, May 2020. European Language Resources Association.
URL: &lt;a href="https://www.aclweb.org/anthology/2020.lrec-1.329"&gt;https://www.aclweb.org/anthology/2020.lrec-1.329&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-linder2020crawler-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='pfeiffer-etal-2022-lifting'&gt;Jonas Pfeiffer, Naman Goyal, Xi&amp;nbsp;Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe.
Lifting the curse of multilinguality by pre-training modular transformers.
In &lt;em&gt;Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies&lt;/em&gt;, 3479–3495. Seattle, United States, July 2022. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2022.naacl-main.255"&gt;https://aclanthology.org/2022.naacl-main.255&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2022.naacl-main.255"&gt;doi:10.18653/v1/2022.naacl-main.255&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-pfeiffer-etal-2022-lifting-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>Wenn ChatGPT den Smartvote-Fragebogen ausfüllt</title><link href="https://vamvas.ch/chatgpt-smartvote" rel="alternate"></link><published>2023-08-23T00:00:00+02:00</published><updated>2023-08-23T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2023-08-23:/chatgpt-smartvote</id><summary type="html">&lt;p&gt;Sind Sprachmodelle politisch voreingenommen?&lt;/p&gt;</summary><content type="html">&lt;p&gt;Mit der zunehmenden Nutzung von Sprachmodellen im Alltag wird auch die Frage relevant, ob diese eine politische Schlagseite haben. Die &lt;a href="https://www.tagesanzeiger.ch/gesinnung-von-kuenstlicher-intelligenz-sollen-frauen-karriere-machen-chatbots-sind-sich-bei-dieser-frage-uneinig-907101722614"&gt;Tamedia-Zeitungen haben Anfang der Woche über eine Studie berichtet, die genau das untersucht hat&lt;/a&gt;. In dieser Studie wurde ein englischsprachiger Fragebogen (&lt;em&gt;Political Compass&lt;/em&gt;) verwendet, um Sprachmodelle auf ihre politische Ausrichtung zu testen.&lt;/p&gt;
&lt;p&gt;Ein &lt;a href="https://aclanthology.org/2023.acl-long.656/"&gt;Ergebnis&lt;/a&gt; war, dass ChatGPT tendenziell links der Mitte zu verorten ist. Allerdings ist der verwendete Fragebogen – wie vieles in der Politik – nicht ganz unumstritten. In der Schweiz gibt es mit &lt;a href="https://smartvote.ch/de/home"&gt;Smartvote&lt;/a&gt; seit 20 Jahren einen Fragebogen, der sich bewährt hat, und heute wurde eine neue Version für die eidgenössischen Wahlen im Oktober 2023 aufgeschaltet. Jetzt bietet es sich an, ChatGPT auch diesen Fragebogen ausfüllen zu lassen.&lt;/p&gt;
&lt;p&gt;Der neue Fragebogen vom Smartvote enthält 75 Fragen, von Umwelt und Energie über gesellschaftliche Fragen bis zum Bundeshaushalt. Smartvote erstellt anhand des Fragebogens eine Grafik mit grossem Wiedererkennungswert, genannt &lt;em&gt;Smartspider&lt;/em&gt;. Eine weitere Besonderheit von Smartvote ist die Mehrsprachigkeit. Diese hat mich vor drei Jahren schon zu einem anderen &lt;a href="/more-general-stance-detection-with-x-stance"&gt;computerlinguistischen Experiment&lt;/a&gt; inspiriert.&lt;/p&gt;
&lt;p&gt;Um ChatGPT den Smartvote-Fragebogen beantworten zu lassen, muss man kein Experte sein. Die Idee ist auch &lt;a href="https://twitter.com/lstuber/status/1600776574470860801"&gt;nicht&lt;/a&gt; &lt;a href="https://twitter.com/astulz/status/1627822920390127616"&gt;komplett&lt;/a&gt; neu. In diesem Blogpost möchte ich die Idee aber etwas genauer anschauen:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ich vergleiche ChatGPT mit dem aktuell grössten Open-Source-Sprachmodell (LLaMA 2), welches weniger einfach zugänglich ist als ChatGPT.&lt;/li&gt;
&lt;li&gt;Ich stelle die Fragen in allen vier Landessprachen, plus Englisch. Interessanterweise hat die Sprache einen Einfluss auf das Resultat – was verschiedene Gründe haben könnte.&lt;/li&gt;
&lt;li&gt;Ich erkläre, warum mehrfaches Generieren von Antworten oder eine Analyse der Wortwahrscheinlichkeiten wichtige Methoden sind, um die Tendenzen eines Sprachmodells zu erfassen.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;ChatGPT vs. LLaMA 2&lt;/h2&gt;
&lt;p&gt;ChatGPT ist ein kommerzielles Sprachmodell von OpenAI, das von registrierten Nutzer*innen &lt;a href="https://chat.openai.com/"&gt;kostenlos benutzt werden&lt;/a&gt; kann. Leider sind die technischen Details von ChatGPT nicht öffentlich zugänglich. Darum ist es nützlich, auch Open-Source-Modelle zu haben, die man genauer unter die Lupe nehmen kann.&lt;/p&gt;
&lt;p&gt;Ein solches Open-Source-Modell ist &lt;a href="https://ai.meta.com/llama/"&gt;LLaMa 2&lt;/a&gt; von Meta. Ich lege den Smartvote-Fragebogen sowohl ChatGPT als auch LLaMa 2 vor, und zwar der &lt;a href="https://huggingface.co/meta-llama/Llama-2-70b-chat-hf"&gt;grössten Version von LLaMa mit 70&amp;nbsp;Milliarden Parametern&lt;/a&gt;, welche für Chat-Applikationen optimiert wurde.&lt;/p&gt;
&lt;p&gt;Ein erster Vergleich zeigt, dass beide Modelle eine ähnliche Smartspider-Grafik erhalten:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Smartspider für ChatGPT und LLaMa 2" src="https://vamvas.ch/assets/chatgpt-smartvote/smartvote-chatgpt-vs-llama-2.png"&gt;
&lt;br&gt;&lt;/p&gt;
&lt;p&gt;Beide Spiders sind stark eingemittet. Das muss aber nicht heissen, dass die Modelle in ihren Antworten übereinstimmen:
Nur auf 38 von 75 Fragen geben sie die gleiche Antwort.
Vielmehr neigen die Modelle dazu, gemässigte Antworten zu geben ("eher ja", "eher nein"):
&lt;img alt="Balkendiagram mit der Verteilung der Antworten von ChatGPT und LLaMa 2" src="https://vamvas.ch/assets/chatgpt-smartvote/verteilung-antworten.png"&gt;&lt;/p&gt;
&lt;p&gt;Einen ausgeprägten Smartspider erhält aber nur, wer hin und wieder auch entschiedene Antworten gibt ("ja", "nein").&lt;/p&gt;
&lt;h2&gt;Spielt die Sprache eine Rolle?&lt;/h2&gt;
&lt;p&gt;Bis jetzt habe ich die Fragen nur auf Deutsch gestellt. Sehen die Smartspiders auch gleich aus, wenn die Fragen auf Französisch, Italienisch oder Rumantsch gestellt werden, oder auf Englisch?
Für diese Sprachen stellt Smartvote eine Übersetzung der Fragen und Antwortoptionen zur Verfügung.
Wenn man diese dem Sprachmodell vorlegt, zeigt sich, dass zum Teil recht verschiedene Smartspiders herauskommen:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Smartspider für ChatGPT und LLaMa 2 im Vergleich für verschiedene Sprachen" src="https://vamvas.ch/assets/chatgpt-smartvote/smartvote-chatgpt-vs-llama-2-multilingual.png" width="55%"&gt;&lt;/p&gt;
&lt;p&gt;Der erste Gedanke ist sicher, dass die Sprache und damit auch der Kulturraum einen Einfluss auf die Antworten des Sprachmodells hat. Oft spricht man in der Schweiz vom Rösti- und Polentagraben, der die verschiedenen Sprachregionen auch politisch trennt. Konkret könnte ich mir vorstellen, dass italienischsprachige Texte, auf denen ChatGPT trainiert wurde, im Durchschnitt leicht andere Meinungen propagieren als etwa Texte in deutscher Sprache.&lt;/p&gt;
&lt;p&gt;Allerdings kann ich mir auch einen profaneren Grund vorstellen: Zufall. Vielleicht werden die Sprachmodelle von Oberflächlichkeiten beeinflusst – zum Beispiel von der Länge der Fragen, der Reihenfolge der Antwortoptionen und von der Wortwahl. Eine solche fehlende &lt;em&gt;Robustheit&lt;/em&gt; ist in der Computerlinguistik oft zu beobachten.&lt;/p&gt;
&lt;p&gt;Um das zu überprüfen, könnte man die Fragen umformulieren und schauen, ob das Modell dann eine andere Meinung ausgibt. Das wäre ein interessantes Experiment für die Zukunft. Würde sich ein Mangel an Robustheit bestätigen, dann wäre es in meinen Augen nicht mehr korrekt, von der "Meinung" oder von der "politischen Ausrichtung" eines Sprachmodells zu sprechen.&lt;/p&gt;
&lt;p&gt;Stattdessen wäre der Smartspider von ChatGPT einfach ein Zufallsprodukt, das Ergebnis eines Prozesses, der gar nichts mit Politik zu tun hat.
Trotzdem können Sprachmodelle eine Bias oder Voreingenommenheit aufweisen.
Nur könnte man diese nicht anhand eines politischen Fragebogens quantifizieren.&lt;/p&gt;
&lt;h2&gt;Zur Methodik&lt;/h2&gt;
&lt;p&gt;Grundsätzlich kann man eine Frage einfach ins Chatfenster eingeben und versuchen, die Antwort in das Schema "ja", "eher ja", "eher nein", "nein" einzuordnen:&lt;/p&gt;
&lt;div style="display: inline-block; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.2);"&gt;
    &lt;img style="margin:0" src="https://vamvas.ch/assets/chatgpt-smartvote/chatgpt-prompting-example.png" alt="Screenshot von ChatGPT mit Frage und (ausweichender) Antwort" /&gt;
&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;Dieser Ansatz hat aber zwei Nachteile:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Oft verweigert das Sprachmodell eine entschiedene Antwort – schliesslich wurde es vom Hersteller dazu trainiert, bei heiklen Fragen auf der sicheren Seite zu bleiben.&lt;/li&gt;
&lt;li&gt;Die Antwort, die wir bekommen, ist für das Modell gar nicht unbedingt die allerwahrscheinlichste Antwort. Stattdessen wird die Antwort durch eine &lt;a href="https://huggingface.co/docs/transformers/generation_strategies#multinomial-sampling"&gt;Stichprobe&lt;/a&gt; erzeugt, bei der auch ein bisschen Zufall im Spiel ist. Wenn man die Frage noch einmal stellt, bekommt man vielleicht die gegenteilige Antwort.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Die Methode, die ich für diesen Blogpost gewählt habe, ist daher etwas komplizierter. Ich weise das Sprachmodell an, entweder mit A, B, C oder D zu antworten. Ich teile dem Modell mit, dass A "ja" heisst, B "eher ja" heisst, etc. Dann vergleiche ich, welche Wahrscheinlichkeiten das Modell den Wörtern zuweist: Ist "A", "B", "C" oder "D" am wahrscheinlichsten als nächstes Wort? Das kann man nicht via das Chatfenster machen, sondern man braucht Programmcode wie zum Beispiel das Projekt &lt;a href="https://lmql.ai/"&gt;LMQL&lt;/a&gt; von der ETH Zürich.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Beispiel einer Wahrscheinlichkeitsverteilung für das erste Wort der Antwort" src="https://vamvas.ch/assets/chatgpt-smartvote/antwort-wahrscheinlichkeit.png" width="70%"&gt;&lt;/p&gt;
&lt;p&gt;Man könnte es auch so formulieren, dass ich das Sprachmodell zwinge, eine der vorgegebenen Antworten zu generieren. Andere Antworten als A, B, C oder D sind nicht erlaubt (&lt;em&gt;forced choice&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;Im Fall von ChatGPT ist es noch nicht möglich, die Wahrscheinlichkeiten der einzelnen Wörter zu analysieren. Dafür ist die API von OpenAI im Moment nicht geeignet.
Ein äquivalenter Ansatz ist es, eine Antwort viele Male neu generieren zu lassen. Wenn man dann zählt, wie oft "A", "B", "C" und "D" herauskommt, kann man die zugrundeliegenden Wahrscheinlichkeiten abschätzen. Das habe ich für ChatGPT so gemacht.&lt;/p&gt;
&lt;h2&gt;FAQ&lt;/h2&gt;
&lt;h3&gt;Wie lauten die Fragen genau?&lt;/h3&gt;
&lt;p&gt;Der Fragebogen ist auf der &lt;a href="https://smartvote.ch/de/home"&gt;Website von Smartvote&lt;/a&gt; verfügbar.&lt;/p&gt;
&lt;h3&gt;Wie wird der Smartspider berechnet?&lt;/h3&gt;
&lt;p&gt;Der Smartspider wurde von Smartvote so definiert, dass ein Teil der Fragen einer oder mehreren Achsen zugeordnet wird. Die genaue Zuteilung ist &lt;a href="https://web.archive.org/web/20230823191600/https://sv22-prod.s3.eu-central-1.amazonaws.com/bh7ildzp3qxh1ezg2zwij6bxpl8w?response-content-disposition=inline%3B%20filename%3D%22Cleavage%20Dokument%20Fragebogen%20NR_SR_de_fr_it.pdf%22%3B%20filename%2A%3DUTF-8%27%27Cleavage%2520Dokument%2520Fragebogen%2520NR_SR_de_fr_it.pdf&amp;amp;response-content-type=application%2Fpdf&amp;amp;X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;amp;X-Amz-Credential=AKIAQ5KGVQEK3IRSRU7W%2F20230823%2Feu-central-1%2Fs3%2Faws4_request&amp;amp;X-Amz-Date=20230823T191217Z&amp;amp;X-Amz-Expires=3600&amp;amp;X-Amz-SignedHeaders=host&amp;amp;X-Amz-Signature=4f4e9700a6670b258a5018faa9c3f67603d6f77f94e39859b8fd01029066403a"&gt;in einem PDF&lt;/a&gt; dokumentiert.&lt;/p&gt;</content><category term="blog"></category></entry><entry><title>My PhD Thesis Is Out! (A Summary)</title><link href="https://vamvas.ch/phd-thesis" rel="alternate"></link><published>2023-04-05T00:00:00+02:00</published><updated>2023-04-05T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2023-04-05:/phd-thesis</id><summary type="html"></summary><content type="html">&lt;p&gt;After quite a few months of writing and polishing, &lt;a href="https://www.zora.uzh.ch/id/eprint/232796/"&gt;my PhD thesis is now available on the e-print repository of my university&lt;/a&gt;.
And a few weeks ago, I have successfully defended the thesis in an oral examination:&lt;/p&gt;
&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;Congratulations to &lt;a href="https://twitter.com/j_vamvas?ref_src=twsrc%5Etfw"&gt;&amp;#64;j_vamvas&lt;/a&gt; , who just passed his viva on &amp;quot;Model-based Evaluation of Multilinguality&amp;quot;! With thanks to the examiners &lt;a href="https://twitter.com/unattributed?ref_src=twsrc%5Etfw"&gt;&amp;#64;unattributed&lt;/a&gt; and &lt;a href="https://twitter.com/LenaAJaeger?ref_src=twsrc%5Etfw"&gt;&amp;#64;LenaAJaeger&lt;/a&gt; . &lt;a href="https://t.co/vhwp8CtTWy"&gt;pic.twitter.com/vhwp8CtTWy&lt;/a&gt;&lt;/p&gt;&amp;mdash; Zurich Computational Linguistics Group (&amp;#64;cl_uzh) &lt;a href="https://twitter.com/cl_uzh/status/1635281993683603456?ref_src=twsrc%5Etfw"&gt;March 13, 2023&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;My thesis combines four research papers published in 2021 and 2022.
All the papers were co-authored with my main supervisor, Rico Sennrich.
Earlier, I have summarized the papers on this blog:
    &lt;a href="https://vamvas.ch/the-limits-of-minimal-sentence-pairs"&gt;1&lt;/a&gt;,
    &lt;a href="https://vamvas.ch/evaluating-black-box-mt-with-contrastive-conditioning"&gt;2a&lt;/a&gt; and
    &lt;a href="https://vamvas.ch/when-mt-distillation-leads-to-bias"&gt;2b&lt;/a&gt;,
    &lt;a href="https://vamvas.ch/lost-and-found-in-translation"&gt;3&lt;/a&gt;, and
    &lt;a href="https://vamvas.ch/nmtscore-text-similarity-via-translation"&gt;4&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;What I added to the thesis was a 40-page introduction with some additional context.
The introduction characterizes the key problem: how to evaluate modern natural language processing&amp;nbsp;(NLP) systems in multiple languages.
Many of these systems – like GPT, DeepL or Google Translate – are designed to handle multilingual input.
Evaluation means figuring out how well the systems do in comparison to each other and to humans.&lt;/p&gt;
&lt;p&gt;Good evaluation practice is necessary for multiple reasons: for research and development&amp;nbsp;(doing experiments and measuring the effect of changes), for real-world applications&amp;nbsp;(ensuring safety) and for society at large&amp;nbsp;(understanding when NLP systems are working well and when they are failing).
But multilinguality remains a great challenge, mostly because there are so many languages in the world, but also because resources, including human resources, are not equally available for all languages.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Motivations and challenges of evaluation in NLP" src="https://vamvas.ch/assets/thesis/thesis-motivation-challenges.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;Most contributions in the thesis focus on &lt;em&gt;targeted&lt;/em&gt; evaluation, where a specific aspect of system quality is examined.
There is a lot of previous work on targeted evaluation; for example, the WinoMT challenge&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Gabriel Stanovsky, Noah&amp;nbsp;A. Smith, and Luke Zettlemoyer.
Evaluating gender bias in machine translation.
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 1679–1684. Florence, Italy, July 2019. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/P19-1164"&gt;https://aclanthology.org/P19-1164&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/P19-1164"&gt;doi:10.18653/v1/P19-1164&lt;/a&gt;.' href='#stanovsky-etal-2019-evaluating' id='ref-stanovsky-etal-2019-evaluating-1'&gt;(Stanovsky et al., 2019)&lt;/a&gt; specifically looked into occupation nouns like ‘doctor’ and their translation in terms of gender.
Current machine translation&amp;nbsp;(MT) systems have an overgeneralization bias and tend to resort to whatever has been the majority gender in the training data, often ignoring the gender information in the input sentence:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Illustration of the WinoMT challenge using DeepL" src="https://vamvas.ch/assets/thesis/winomt-challenge.png" width="650px"&gt;
&lt;em&gt;An English–German example for WinoMT. I have annotated the gender of ‘doctor’ and its translations with emoji. Note that this is not a shortcoming of DeepL in particular – other MT systems tend to do similarly bad.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;But the idea behind WinoMT poses a challenge to evaluation as well: How can we account for the fact that there are many good translations of these sentences, and still reliably judge whether the occupation noun has been correctly translated in terms of gender?
My thesis &lt;a href="https://vamvas.ch/the-limits-of-minimal-sentence-pairs"&gt;discusses different methods for targeted evaluation and presents novel experiments that highlight limitations in previous methods&lt;/a&gt;&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jannis Vamvas and Rico Sennrich.
On the limits of minimal pairs in contrastive evaluation.
In &lt;em&gt;Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP&lt;/em&gt;, 58–68. Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2021.blackboxnlp-1.5"&gt;https://aclanthology.org/2021.blackboxnlp-1.5&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2021.blackboxnlp-1.5"&gt;doi:10.18653/v1/2021.blackboxnlp-1.5&lt;/a&gt;.' href='#vamvas-sennrich-2021-limits' id='ref-vamvas-sennrich-2021-limits-1'&gt;(Vamvas and Sennrich, 2021)&lt;/a&gt;.
We then propose a new model-based targeted evaluation method called &lt;a href="https://vamvas.ch/evaluating-black-box-mt-with-contrastive-conditioning"&gt;&lt;em&gt;Contrastive
Conditioning&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The idea behind Contrastive Conditioning is to classify the machine translation using another MT system.
We can delegate the evaluation process to that system if we provide it with extra information via an augmented source sequence:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Illustration of Contrastive Conditioning on the example of WinoMT" src="https://vamvas.ch/assets/thesis/contrastive-conditioning.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;Model-based evaluation allows us to scale the evaluation across multiple target languages, without having to create custom test sets for every target language.
Another advantage of our method is that anyone can automatically analyze the translations from black-box systems like DeepL.
Our method does not require access to the system that has generated the
machine translations.&lt;/p&gt;
&lt;p&gt;In the thesis, we demonstrate how the method can be applied to the problems of
(1) &lt;a href="https://vamvas.ch/when-mt-distillation-leads-to-bias"&gt;measuring lexical overgeneralization bias in MT&amp;nbsp;(like WinoMT) and
showing that distilled translation models overgeneralize more strongly&lt;/a&gt;&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jannis Vamvas and Rico Sennrich.
Contrastive conditioning for assessing disambiguation in &lt;span class="bibtex-protected"&gt;MT&lt;/span&gt;: &lt;span class="bibtex-protected"&gt;A&lt;/span&gt; case study of distilled bias.
In &lt;em&gt;Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 10246–10265. Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2021.emnlp-main.803"&gt;https://aclanthology.org/2021.emnlp-main.803&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2021.emnlp-main.803"&gt;doi:10.18653/v1/2021.emnlp-main.803&lt;/a&gt;.' href='#vamvas-sennrich-2021-contrastive' id='ref-vamvas-sennrich-2021-contrastive-1'&gt;(Vamvas and Sennrich, 2021)&lt;/a&gt;,
and (2) &lt;a href="https://vamvas.ch/lost-and-found-in-translation"&gt;detecting coverage errors in MT, e.g., detecting whether information has been lost in translation&lt;/a&gt;&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jannis Vamvas and Rico Sennrich.
As little as possible, as much as necessary: detecting over- and undertranslations with contrastive conditioning.
In &lt;em&gt;Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)&lt;/em&gt;, 490–500. Dublin, Ireland, May 2022. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2022.acl-short.53"&gt;https://aclanthology.org/2022.acl-short.53&lt;/a&gt;.' href='#vamvas-sennrich-2022-little' id='ref-vamvas-sennrich-2022-little-1'&gt;(Vamvas and Sennrich, 2022)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A recurring idea in the thesis is that MT systems can be useful beyond translation: as a model of multilinguality and of semantic equivalence across languages.
A perfect example is the method described above, since Contrastive Conditioning makes use of MT to automate targeted evaluation.
But in our last paper, we looked at the idea from a different angle and &lt;a href="https://vamvas.ch/nmtscore-text-similarity-via-translation"&gt;demonstrated different ways of how an MT system can be queried for estimating semantic similarity&lt;/a&gt;&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jannis Vamvas and Rico Sennrich.
&lt;span class="bibtex-protected"&gt;NMTS&lt;/span&gt;core: a multilingual analysis of translation-based text similarity measures.
In &lt;em&gt;Findings of the Association for Computational Linguistics: EMNLP 2022&lt;/em&gt;, 198–213. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2022.findings-emnlp.15"&gt;https://aclanthology.org/2022.findings-emnlp.15&lt;/a&gt;.' href='#vamvas-sennrich-2022-nmtscore' id='ref-vamvas-sennrich-2022-nmtscore-1'&gt;(Vamvas and Sennrich, 2022)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Specifically, MT systems can judge whether two sentences are paraphrases of each other&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Brian Thompson and Matt Post.
Automatic machine translation evaluation in many languages via zero-shot paraphrasing.
In &lt;em&gt;Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)&lt;/em&gt;, 90–121. Online, November 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.emnlp-main.8"&gt;https://aclanthology.org/2020.emnlp-main.8&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.emnlp-main.8"&gt;doi:10.18653/v1/2020.emnlp-main.8&lt;/a&gt;.' href='#thompson-post-2020-automatic' id='ref-thompson-post-2020-automatic-1'&gt;(Thompson and Post, 2020)&lt;/a&gt;.
This is especially interesting if phrases seem to look similar but have an opposite meaning, as in &lt;a href="https://arxiv.org/abs/1904.01130"&gt;&lt;em&gt;“Flights from New York to Florida”&lt;/em&gt; vs. &lt;em&gt;“Flights from Florida to New York”&lt;/em&gt;&lt;/a&gt;.
In that case, we showed that MT-based approaches, such as our proposed &lt;em&gt;translation cross-likelihood&lt;/em&gt; measure, are much more accurate than alternative approaches:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bar chart of the accuracies of various approaches on multilingual paraphrase identification" src="https://vamvas.ch/assets/thesis/paraphrase-identification-results.png" width="550px"&gt;
&lt;em&gt;Accuracy of different text similarity measures on paraphrase identification (on average across test sets in 9 languages).
  The accuracy of translation-based measures can be increased by applying a normalization (dark red), such as reconstruction normalization.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Overall, I see my thesis as a contribution towards extending the range of technical possibilities in multilingual NLP evaluation.
But there are still many limitations, including fundamental ones.
Seventy years ago, Warren Weaver&amp;nbsp;(1894–1978), an influential technologist at the dawn of the computer age, put forward four principles that he saw as crucial for advancing NLP&amp;nbsp;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Warren Weaver.
Translation.
In &lt;em&gt;Proceedings of the Conference on Mechanical Translation&lt;/em&gt;. Massachusetts Institute of Technology, 17-20 June 1952.
URL: &lt;a href="https://aclanthology.org/1952.earlymt-1.1"&gt;https://aclanthology.org/1952.earlymt-1.1&lt;/a&gt;.' href='#weaver-1952-translation' id='ref-weaver-1952-translation-1'&gt;(Weaver, 1952)&lt;/a&gt;.
In the introduction to my thesis, I revisit his memorandum and find that three of his principles have been realized by now, in some form or other.
The principles envisioned by Weaver – contextualization, machine learning and information theory – are now closely reflected in the state of the art of NLP.&lt;/p&gt;
&lt;p&gt;However, his memorandum concludes with a fourth principle: multilinguality.
And while this idea has inspired research ever since, the terms in which modern NLP systems can be understood as being truly multilingual are still unclear.
Common to the methods presented in this thesis is a functionalist approach – crafting inputs and observing the outputs of NLP systems.
To bring Weaver’s Fourth Principle to fruition and to verify its success, a functionalist approach might not be enough for NLP.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The work presented in this post was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).
I thank the members of the supervisory and doctoral committees for their valuable feedback.&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='stanovsky-etal-2019-evaluating'&gt;Gabriel Stanovsky, Noah&amp;nbsp;A. Smith, and Luke Zettlemoyer.
Evaluating gender bias in machine translation.
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 1679–1684. Florence, Italy, July 2019. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/P19-1164"&gt;https://aclanthology.org/P19-1164&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/P19-1164"&gt;doi:10.18653/v1/P19-1164&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-stanovsky-etal-2019-evaluating-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='thompson-post-2020-automatic'&gt;Brian Thompson and Matt Post.
Automatic machine translation evaluation in many languages via zero-shot paraphrasing.
In &lt;em&gt;Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)&lt;/em&gt;, 90–121. Online, November 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.emnlp-main.8"&gt;https://aclanthology.org/2020.emnlp-main.8&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.emnlp-main.8"&gt;doi:10.18653/v1/2020.emnlp-main.8&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-thompson-post-2020-automatic-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='vamvas-sennrich-2021-contrastive'&gt;Jannis Vamvas and Rico Sennrich.
Contrastive conditioning for assessing disambiguation in &lt;span class="bibtex-protected"&gt;MT&lt;/span&gt;: &lt;span class="bibtex-protected"&gt;A&lt;/span&gt; case study of distilled bias.
In &lt;em&gt;Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 10246–10265. Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2021.emnlp-main.803"&gt;https://aclanthology.org/2021.emnlp-main.803&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2021.emnlp-main.803"&gt;doi:10.18653/v1/2021.emnlp-main.803&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-vamvas-sennrich-2021-contrastive-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='vamvas-sennrich-2021-limits'&gt;Jannis Vamvas and Rico Sennrich.
On the limits of minimal pairs in contrastive evaluation.
In &lt;em&gt;Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP&lt;/em&gt;, 58–68. Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2021.blackboxnlp-1.5"&gt;https://aclanthology.org/2021.blackboxnlp-1.5&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2021.blackboxnlp-1.5"&gt;doi:10.18653/v1/2021.blackboxnlp-1.5&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-vamvas-sennrich-2021-limits-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='vamvas-sennrich-2022-little'&gt;Jannis Vamvas and Rico Sennrich.
As little as possible, as much as necessary: detecting over- and undertranslations with contrastive conditioning.
In &lt;em&gt;Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)&lt;/em&gt;, 490–500. Dublin, Ireland, May 2022. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2022.acl-short.53"&gt;https://aclanthology.org/2022.acl-short.53&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-vamvas-sennrich-2022-little-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='vamvas-sennrich-2022-nmtscore'&gt;Jannis Vamvas and Rico Sennrich.
&lt;span class="bibtex-protected"&gt;NMTS&lt;/span&gt;core: a multilingual analysis of translation-based text similarity measures.
In &lt;em&gt;Findings of the Association for Computational Linguistics: EMNLP 2022&lt;/em&gt;, 198–213. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2022.findings-emnlp.15"&gt;https://aclanthology.org/2022.findings-emnlp.15&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-vamvas-sennrich-2022-nmtscore-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='weaver-1952-translation'&gt;Warren Weaver.
Translation.
In &lt;em&gt;Proceedings of the Conference on Mechanical Translation&lt;/em&gt;. Massachusetts Institute of Technology, 17-20 June 1952.
URL: &lt;a href="https://aclanthology.org/1952.earlymt-1.1"&gt;https://aclanthology.org/1952.earlymt-1.1&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-weaver-1952-translation-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>Introducing SwissBERT</title><link href="https://vamvas.ch/introducing-swissbert" rel="alternate"></link><published>2023-03-24T00:00:00+01:00</published><updated>2023-03-24T00:00:00+01:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2023-03-24:/introducing-swissbert</id><summary type="html">&lt;p&gt;The multilingual language model for Switzerland.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Self-supervised text encoders such as BERT are an important tool for natural language processing (NLP) applications.
You won’t see them chatting away on the web like some large-scale generative models.
But they are useful for NLP practitioners because they can be trained with mere billions, rather than trillions, of words, and they lend themselves to supervised fine-tuning.&lt;/p&gt;
&lt;p&gt;After the &lt;a href="https://github.com/google-research/bert"&gt;original BERT model&lt;/a&gt; was released for English, many others were created, like &lt;a href="https://camembert-model.fr/"&gt;CamemBERT&lt;/a&gt; for French and &lt;a href="https://github.com/idb-ita/GilBERTo"&gt;GilBERTo&lt;/a&gt; for Italian.
Now, a team at the University of Zurich is adding another one to the list: We release &lt;a href="https://huggingface.co/ZurichNLP/swissbert"&gt;SwissBERT&lt;/a&gt;, the multilingual language model for Switzerland.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Schematic illustration of SwissBERT and its language adapters" src="https://vamvas.ch/assets/swissbert/swissbert-diagram.png" width="450px"&gt;&lt;/p&gt;
&lt;p&gt;SwissBERT supports Swiss Standard German, French, Italian and Romansh Grischun. It might even be extended to Swiss German dialects in the future.&lt;/p&gt;
&lt;p&gt;Why wasn’t there a unified model for the Swiss national languages until now?
A reason is that Switzerland is a multilingual country, and multilinguality is still a challenge in NLP.
While there are plenty of training data available on the web for German, French, and Italian, there are little data for Romansh.
Making sure that the higher-resource languages do not drown out the other languages in the model is non-trivial.&lt;/p&gt;
&lt;p&gt;Another consideration is that there are already open-source models for three out of the four languages.
Ideally, one could somehow combine these existing resources, adapt them to the peculiarities of Switzerland and add Romansh to the mix.
Figuring out how to apply such a “Swiss Finish” has been another challenge.&lt;/p&gt;
&lt;p&gt;To tackle these challenges, we used an approach from the recent literature: language-specific model components, or simply: &lt;em&gt;language adapters&lt;/em&gt;.
An advantage of language adapters is that each language has a reserved module of equal capacity.
Each language adapter is activated only if the model processes input in the given language.&lt;/p&gt;
&lt;p&gt;Specifically, we based SwissBERT on a massively multilingual model, &lt;a href="https://huggingface.co/facebook/xmod-base"&gt;X-MOD&lt;/a&gt;, which has been pre-trained with language adapters from scratch by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jonas Pfeiffer, Naman Goyal, Xi&amp;nbsp;Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe.
Lifting the curse of multilinguality by pre-training modular transformers.
In &lt;em&gt;Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies&lt;/em&gt;, 3479–3495. Seattle, United States, July 2022. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2022.naacl-main.255"&gt;https://aclanthology.org/2022.naacl-main.255&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2022.naacl-main.255"&gt;doi:10.18653/v1/2022.naacl-main.255&lt;/a&gt;.' href='#pfeiffer-etal-2022-lifting' id='ref-pfeiffer-etal-2022-lifting-1'&gt;Pfeiffer et al. (2022)&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="We train two variants of SwissBERT: Variant 1 reuses the vocabulary and embeddings of the pre-trained model, and
only language adapters are trained. Variant 2 uses a custom SwissBERT vocabulary based on our pre-training corpus, and
multilingual embeddings are trained in addition to the adapters." src="https://vamvas.ch/assets/swissbert/swissbert-xmod-adaptation.png"&gt;&lt;/p&gt;
&lt;p&gt;We trained the existing German, French and Italian adapters as well as a new Romansh adapter on 21 million news articles from Switzerland.
Testing out two design variants, we found that a custom vocabulary and custom trained word embeddings (Variant 2 on the right) are better suited for the Swiss national languages than those of the massively multilingual X-MOD.&lt;/p&gt;
&lt;p&gt;The 21 million news articles have been retrieved from &lt;a href="https://t.uzh.ch/1hI"&gt;Swissdox&amp;#64;LiRI&lt;/a&gt;, which provides access to many newspapers in the Swiss Media Database (SMD).
Thus, rather than crawling the web, we had the chance to use a high-quality and clearly defined corpus for pre-training.&lt;/p&gt;
&lt;p&gt;After ten passes through the pre-training corpus, we evaluated SwissBERT on a range of NLP tasks related to Switzerland.
We first wanted to see how well it handles the sort of text it has been pre-trained on: contemporary news from Switzerland. We also looked into slightly different text domains to gauge the general performance of the model.&lt;/p&gt;
&lt;p&gt;To evaluate our model on named entity recognition (NER), we created &lt;a href="https://huggingface.co/datasets/ZurichNLP/swissner"&gt;a collection of small-scale test sets for the four national languages&lt;/a&gt;.
Below is an example for the expected output of NER:
&lt;img alt="Example for the SwissNER dataset" src="https://vamvas.ch/assets/swissbert/swissner-example.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;We were pleased to see that SwissBERT clearly outperforms the baselines on our test sets, both in terms of supervised NER (German, French and Italian) and zero-shot cross-lingual transfer (Romansh):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Results on NER" src="https://vamvas.ch/assets/swissbert/swissner-results.png" width="550px"&gt;&lt;/p&gt;
&lt;p&gt;Another nice result is that SwissBERT performs unsupervised alignment of Romansh text to German text &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Eyal Dolev.
Using multilingual word embeddings for similarity-based word alignments in a zero-shot setting. tested on the case of &lt;span class="bibtex-protected"&gt;G&lt;/span&gt;erman–&lt;span class="bibtex-protected"&gt;R&lt;/span&gt;omansh.
Master&amp;#39;s thesis, Department of Computational Linguistics, University of Zurich, 2022.' href='#dolev2022thesis' id='ref-dolev2022thesis-1'&gt;(Dolev, 2022)&lt;/a&gt; much more accurately than previous models, which have not been trained on Romansh:
&lt;img alt="Results on German–Romansh alignment" src="https://vamvas.ch/assets/swissbert/alignment-results.png" width="450px"&gt;
In other words, SwissBERT is good at comparing Romansh sentences to German sentences and identifying similar and dissimilar words and phrases.
The same probably holds for other language combinations.&lt;/p&gt;
&lt;p&gt;We also evaluated on &lt;a href="https://vamvas.ch/more-general-stance-detection-with-x-stance"&gt;cross-lingual classification of political comments&lt;/a&gt; (which works well) as well as &lt;a href="https://hipe-eval.github.io/HIPE-2022/"&gt;NER for historical newspapers&lt;/a&gt; (which does not work too well with SwissBERT).
The complete results are documented in our &lt;a href="https://arxiv.org/abs/2303.13310"&gt;paper pre-print&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Given the encouraging results, we hope that our model can support researchers who need to analyze large amounts of contemporary written text in one of the Swiss national languages.
Due to the nature of the pre-training corpus, we &lt;a href="https://huggingface.co/ZurichNLP/swissbert"&gt;release the model with the CC BY-NC 4.0 license&lt;/a&gt; for now.
This means that it can immediately be used by academic researchers, but not (yet?) for commercial applications.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The work presented in this post is joint work with &lt;a href="https://www.cl.uzh.ch/de/people/alumni/graen.html"&gt;Johannes Graën&lt;/a&gt; of UZH's Linguistic Research Infrastructure (LiRI) and my PhD supervisor &lt;a href="https://www.cl.uzh.ch/de/people/team/compling/sennrich.html"&gt;Rico Sennrich&lt;/a&gt;. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='dolev2022thesis'&gt;Eyal Dolev.
Using multilingual word embeddings for similarity-based word alignments in a zero-shot setting. tested on the case of &lt;span class="bibtex-protected"&gt;G&lt;/span&gt;erman–&lt;span class="bibtex-protected"&gt;R&lt;/span&gt;omansh.
Master&amp;#39;s thesis, Department of Computational Linguistics, University of Zurich, 2022. &lt;a class="cite-backref" href="#ref-dolev2022thesis-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='pfeiffer-etal-2022-lifting'&gt;Jonas Pfeiffer, Naman Goyal, Xi&amp;nbsp;Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe.
Lifting the curse of multilinguality by pre-training modular transformers.
In &lt;em&gt;Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies&lt;/em&gt;, 3479–3495. Seattle, United States, July 2022. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2022.naacl-main.255"&gt;https://aclanthology.org/2022.naacl-main.255&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2022.naacl-main.255"&gt;doi:10.18653/v1/2022.naacl-main.255&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-pfeiffer-etal-2022-lifting-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>Translation Puzzles are In‑context Learning Tasks</title><link href="https://vamvas.ch/translation-puzzles-are-in-context-learning-tasks" rel="alternate"></link><published>2022-12-05T00:00:00+01:00</published><updated>2022-12-05T00:00:00+01:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2022-12-05:/translation-puzzles-are-in-context-learning-tasks</id><summary type="html">&lt;p&gt;Large language models can tackle some hard linguistic tasks.&lt;/p&gt;</summary><content type="html">&lt;p&gt;A research preview of OpenAI's &lt;a href="https://openai.com/blog/chatgpt/"&gt;ChatGPT&lt;/a&gt; has received a lot of attention.
The positive public reaction seems well-deserved, given that the system is not only a state-of-the-art language model, but has also been carefully fine-tuned based on human feedback.
As a result, ChatGPT's answers seem a bit more useful than the output of previous language models, even though the system has clear &lt;a href="https://openai.com/blog/chatgpt/#limitations"&gt;limitations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When I tested ChatGPT myself last week, one of the things I tried was difficult translation puzzles.
Here is an example of such a puzzle, taken from a paper by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='G&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;zde&amp;nbsp;G&lt;span class="bibtex-protected"&gt;ü&lt;/span&gt;l &lt;span class="bibtex-protected"&gt;&lt;span class="bibtex-protected"&gt;Ş&lt;/span&gt;&lt;/span&gt;ahin, Yova Kementchedjhieva, Phillip Rust, and Iryna Gurevych.
&lt;span class="bibtex-protected"&gt;P&lt;/span&gt;uzz&lt;span class="bibtex-protected"&gt;L&lt;/span&gt;ing &lt;span class="bibtex-protected"&gt;M&lt;/span&gt;achines: &lt;span class="bibtex-protected"&gt;A&lt;/span&gt; &lt;span class="bibtex-protected"&gt;C&lt;/span&gt;hallenge on &lt;span class="bibtex-protected"&gt;L&lt;/span&gt;earning &lt;span class="bibtex-protected"&gt;F&lt;/span&gt;rom &lt;span class="bibtex-protected"&gt;S&lt;/span&gt;mall &lt;span class="bibtex-protected"&gt;D&lt;/span&gt;ata.
In &lt;em&gt;Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 1241–1254. Online, July 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.acl-main.115"&gt;https://aclanthology.org/2020.acl-main.115&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.acl-main.115"&gt;doi:10.18653/v1/2020.acl-main.115&lt;/a&gt;.' href='#sahin-etal-2020-puzzling' id='ref-sahin-etal-2020-puzzling-1'&gt;Şahin et al. (2020)&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the &amp;quot;Chickasaw&amp;quot; translation puzzle." src="https://vamvas.ch/assets/chatgpt-puzzling/chickasaw.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;The puzzle has originally been created by Tom Payne for a Linguistic Olympiad, where students from around the world compete on linguistic tasks.
Translation puzzles are very challenging for common natural language processing algorithms.
&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='G&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;zde&amp;nbsp;G&lt;span class="bibtex-protected"&gt;ü&lt;/span&gt;l &lt;span class="bibtex-protected"&gt;&lt;span class="bibtex-protected"&gt;Ş&lt;/span&gt;&lt;/span&gt;ahin, Yova Kementchedjhieva, Phillip Rust, and Iryna Gurevych.
&lt;span class="bibtex-protected"&gt;P&lt;/span&gt;uzz&lt;span class="bibtex-protected"&gt;L&lt;/span&gt;ing &lt;span class="bibtex-protected"&gt;M&lt;/span&gt;achines: &lt;span class="bibtex-protected"&gt;A&lt;/span&gt; &lt;span class="bibtex-protected"&gt;C&lt;/span&gt;hallenge on &lt;span class="bibtex-protected"&gt;L&lt;/span&gt;earning &lt;span class="bibtex-protected"&gt;F&lt;/span&gt;rom &lt;span class="bibtex-protected"&gt;S&lt;/span&gt;mall &lt;span class="bibtex-protected"&gt;D&lt;/span&gt;ata.
In &lt;em&gt;Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 1241–1254. Online, July 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.acl-main.115"&gt;https://aclanthology.org/2020.acl-main.115&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.acl-main.115"&gt;doi:10.18653/v1/2020.acl-main.115&lt;/a&gt;.' href='#sahin-etal-2020-puzzling' id='ref-sahin-etal-2020-puzzling-2'&gt;Şahin et al. (2020)&lt;/a&gt; have demonstrated this in their &lt;a href="https://ukplab.github.io/PuzzLing-Machines/"&gt;PuzzLing Machines benchmark&lt;/a&gt;, where the highest accuracy reached by any algorithm has been 3.2%.&lt;/p&gt;
&lt;p&gt;The reason why the algorithms cannot solve the puzzles is that they are not really designed for this task.
Machine learning does not favor tasks where there are very few examples but each example requires intensive analysis.
Neural networks in particular are usually trained on tens of millions of example sentences, rather than just seven sentences.
A friend of mine, Antonio Bikić, has compared this phenomenon to the &lt;a href="https://en.wikipedia.org/wiki/Dutch_disease"&gt;"Dutch Disease"&lt;/a&gt; in economics:
The abundance of data in natural language processing has led researchers to neglect methods that could use data in an intensive, rather than extensive, manner.&lt;/p&gt;
&lt;p&gt;Now let's see how ChatGPT handles the puzzle. The top part is my question and the bottom part is ChatGPT's reply:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of ChatGPT solving the puzzle." src="https://vamvas.ch/assets/chatgpt-puzzling/chickasaw_chat.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;ChatGPT has translated the first two sentences correctly.
The third translation is not quite correct; according to the reference translation it should be "Holloli" instead of "Hollo."
Nevertheless, this looks like a great start.&lt;/p&gt;
&lt;h2&gt;Testing on the full benchmark&lt;/h2&gt;
&lt;p&gt;ChatGPT might have seen this particular puzzle during training since it is included as an example in the benchmark paper.
So I've asked ChatGPT to also solve the 142 puzzles from the PuzzlingMachines test set, for which there are no solutions on the web.
Some of these require ChatGPT to translate from English to another language, and some are variants where translations into English need to be created.
Here I report the average score for the two directions.&lt;/p&gt;
&lt;p&gt;In terms of ChrF, which is a metric that calculates the character overlap to the reference translation, the results are as follows:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bar chart of the results in terms of ChrF." src="https://vamvas.ch/assets/chatgpt-puzzling/chrf.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;While baselines such as Phrase-based Statistical MT achieve up to 40.2%, ChatGPT reaches 65.9%.&lt;/p&gt;
&lt;p&gt;The other metric I've looked at is the ratio of exact matches. This metric is lower than ChrF because partially correct translations do not receive any credit here.
As a result, previous baselines have achieved only little more than zero accuracy:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bar chart of the results in terms of exact-match accuracy." src="https://vamvas.ch/assets/chatgpt-puzzling/em.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;ChatGPT reaches 23.9% exact-match accuracy. Most of its answers are still wrong, but it performs much better than the previous baselines.&lt;/p&gt;
&lt;h2&gt;Reasons for the high accuracy&lt;/h2&gt;
&lt;p&gt;Why does ChatGPT do so much better when solving translation puzzles? It seems to me that the way ChatGPT works is a very good fit for these translation puzzles.&lt;/p&gt;
&lt;p&gt;First of all, ChatGPT avoids repetition.
Repetition is a notorious problem in text generation, and so the developers have probably put in some guardrails against this behavior.
Avoidance of repetition is also a property of translation puzzles.
Each test sentence in a translation puzzle somehow reuses the vocabulary from the examples in a previously unseen way.
As long as ChatGPT tries to do something with the input while not repeating it verbatim, it will likely get some answers right.&lt;/p&gt;
&lt;p&gt;Another consideration is that ChatGPT has probably seen many examples of the non-English languages during training.
Most of the languages in the translation puzzles have few speakers, such as &lt;a href="https://en.wikipedia.org/wiki/Chickasaw_language"&gt;Chickasaw&lt;/a&gt; from the initial example, which has 75 speakers according to Wikipedia.
But a few puzzles also involve languages with many speakers, such as Polish or Greek.
This might allow ChatGPT to translate some test sentences without even looking at the example sentences.&lt;/p&gt;
&lt;p&gt;However, the most important advantage of ChatGPT is probably an idea called &lt;em&gt;in-context learning&lt;/em&gt;.
The previous approaches have divided the puzzle into two phases:
First, a statistical model is trained on the example sentences.
Then, that model makes a prediction for each test sentence.&lt;/p&gt;
&lt;p&gt;In contrast, ChatGPT can process the puzzle as a whole.
All the example sentences and all the test sentences are part of the context provided to the model.
When it generates an answer, the language model can attend to all the relevant parts of the context simultaneously.&lt;/p&gt;
&lt;p&gt;It is a matter of debate whether this attention to the context can be seen as a form of learning.
Irrespective of what it is called, the phenomenon has inspired interesting approaches to dealing with little training data, especially if similar patterns have already been in the training data of the language model &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared&amp;nbsp;D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.
Language models are few-shot learners.
In H.&amp;nbsp;Larochelle, M.&amp;nbsp;Ranzato, R.&amp;nbsp;Hadsell, M.F. Balcan, and H.&amp;nbsp;Lin, editors, &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, volume&amp;nbsp;33, 1877–1901. Curran Associates, Inc., 2020.
URL: &lt;a href="https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf"&gt;https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf&lt;/a&gt;.' href='#brown2020gpt' id='ref-brown2020gpt-1'&gt;(Brown et al., 2020)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Unfortunately, it will be difficult to analyze these individual factors as long as the training details of ChatGPT, as well as the model weights and code, are not public.
I hope that other institutions will be able to replicate ChatGPT in an open form, as has happened with other breakthroughs in the past.&lt;/p&gt;
&lt;h2&gt;Other observations&lt;/h2&gt;
&lt;p&gt;In one case, the system declined to give a solution and replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;It is not possible to provide accurate translations for the test sentences without more information about the language they are written in and the context in which they are used. The sentences provided are not in a recognizable language and do not follow any discernible grammar or syntax rules, so it is impossible to determine their meaning or provide translations for them.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I clicked the "Try again" button and used the answers from the second attempt.&lt;/p&gt;
&lt;p&gt;In another instance, ChatGPT offered a solution but added the following note:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Note: These translations are based on the example sentences provided and may not be correct. It is always best to consult a qualified translator or language expert for accurate translations.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On the one hand, this kind of hedging is typical for ChatGPT, and will certainly increase the safety of the system.
On the other hand, it is not quite clear why it would happily have a go at a puzzle in 99% percent of the cases, and defer to a linguist in the other 1%.&lt;/p&gt;
&lt;p&gt;Finally, an important difference in the way the puzzles are used at a Linguistics Olympiad is that the participants are expected to provide an explanation for their solution.
As &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='G&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;zde&amp;nbsp;G&lt;span class="bibtex-protected"&gt;ü&lt;/span&gt;l &lt;span class="bibtex-protected"&gt;&lt;span class="bibtex-protected"&gt;Ş&lt;/span&gt;&lt;/span&gt;ahin, Yova Kementchedjhieva, Phillip Rust, and Iryna Gurevych.
&lt;span class="bibtex-protected"&gt;P&lt;/span&gt;uzz&lt;span class="bibtex-protected"&gt;L&lt;/span&gt;ing &lt;span class="bibtex-protected"&gt;M&lt;/span&gt;achines: &lt;span class="bibtex-protected"&gt;A&lt;/span&gt; &lt;span class="bibtex-protected"&gt;C&lt;/span&gt;hallenge on &lt;span class="bibtex-protected"&gt;L&lt;/span&gt;earning &lt;span class="bibtex-protected"&gt;F&lt;/span&gt;rom &lt;span class="bibtex-protected"&gt;S&lt;/span&gt;mall &lt;span class="bibtex-protected"&gt;D&lt;/span&gt;ata.
In &lt;em&gt;Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 1241–1254. Online, July 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.acl-main.115"&gt;https://aclanthology.org/2020.acl-main.115&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.acl-main.115"&gt;doi:10.18653/v1/2020.acl-main.115&lt;/a&gt;.' href='#sahin-etal-2020-puzzling' id='ref-sahin-etal-2020-puzzling-3'&gt;Şahin et al. (2020)&lt;/a&gt; mention, the students will still get some points if their approach is valid.
When I asked ChatGPT to explain its solution, it told me it was "happy to explain" and wrote five paragraphs with a lot of linguistic jargon.
Needless to say that the explanation was all wrong.&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='brown2020gpt'&gt;Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared&amp;nbsp;D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.
Language models are few-shot learners.
In H.&amp;nbsp;Larochelle, M.&amp;nbsp;Ranzato, R.&amp;nbsp;Hadsell, M.F. Balcan, and H.&amp;nbsp;Lin, editors, &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, volume&amp;nbsp;33, 1877–1901. Curran Associates, Inc., 2020.
URL: &lt;a href="https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf"&gt;https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-brown2020gpt-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='sahin-etal-2020-puzzling'&gt;G&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;zde&amp;nbsp;G&lt;span class="bibtex-protected"&gt;ü&lt;/span&gt;l &lt;span class="bibtex-protected"&gt;&lt;span class="bibtex-protected"&gt;Ş&lt;/span&gt;&lt;/span&gt;ahin, Yova Kementchedjhieva, Phillip Rust, and Iryna Gurevych.
&lt;span class="bibtex-protected"&gt;P&lt;/span&gt;uzz&lt;span class="bibtex-protected"&gt;L&lt;/span&gt;ing &lt;span class="bibtex-protected"&gt;M&lt;/span&gt;achines: &lt;span class="bibtex-protected"&gt;A&lt;/span&gt; &lt;span class="bibtex-protected"&gt;C&lt;/span&gt;hallenge on &lt;span class="bibtex-protected"&gt;L&lt;/span&gt;earning &lt;span class="bibtex-protected"&gt;F&lt;/span&gt;rom &lt;span class="bibtex-protected"&gt;S&lt;/span&gt;mall &lt;span class="bibtex-protected"&gt;D&lt;/span&gt;ata.
In &lt;em&gt;Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 1241–1254. Online, July 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.acl-main.115"&gt;https://aclanthology.org/2020.acl-main.115&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.acl-main.115"&gt;doi:10.18653/v1/2020.acl-main.115&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-sahin-etal-2020-puzzling-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-sahin-etal-2020-puzzling-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-sahin-etal-2020-puzzling-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-sahin-etal-2020-puzzling-3" title="Jump back to reference 3"&gt;&lt;sup&gt;3&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>Three Diffusion Digressions</title><link href="https://vamvas.ch/three-diffusion-digressions" rel="alternate"></link><published>2022-09-18T00:00:00+02:00</published><updated>2022-09-18T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2022-09-18:/three-diffusion-digressions</id><summary type="html">&lt;p&gt;The Stable Diffusion release inspired me to make tiny concept art.&lt;/p&gt;</summary><content type="html">&lt;p&gt;It was a nice coincidence that the Stable Diffusion model was released right before my vacation.
While DALL·E 2 had popularized image generation with diffusion before, few people have access to it.
In contrast, &lt;a href="https://github.com/CompVis/stable-diffusion"&gt;Stable Diffusion&lt;/a&gt; is open source, and hundreds of developers and artists all over the world have started to experiment with it and have been sharing their creations online.
Since I now had a week of free time at hand, I decided to explore the model on my own.
In this blogpost, I document the three mini-inventions that I came up with.&lt;/p&gt;
&lt;h2&gt;Hypermosaics&lt;/h2&gt;
&lt;blockquote class="twitter-tweet" data-dnt="true"&gt;&lt;p lang="en" dir="ltr"&gt;photo of a turtle all the way down &lt;a href="https://twitter.com/hashtag/stablediffusion?src=hash&amp;amp;ref_src=twsrc%5Etfw"&gt;#stablediffusion&lt;/a&gt; &lt;a href="https://t.co/JlHyHTscxq"&gt;pic.twitter.com/JlHyHTscxq&lt;/a&gt;&lt;/p&gt;&amp;mdash; Jannis Vamvas (@j_vamvas) &lt;a href="https://twitter.com/j_vamvas/status/1568970877768765444?ref_src=twsrc%5Etfw"&gt;September 11, 2022&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;I think many people have heard of photomosaics, which &lt;a href="https://en.wikipedia.org/wiki/Photographic_mosaic"&gt;go back to the nineties&lt;/a&gt;. People who have tried creating a photomosaic will also know how difficult it is.
The goal of a photomosaic is to compose a primary image out of a large number of component images.
Because all these images are pre-defined – e.g., they are photos made by human photographers – the process is computationally complex and usually requires a few tricks, such as manipulating the color of the component images.&lt;/p&gt;
&lt;p&gt;In contrast, creating a photomosaic using images generated by Stable Diffusion is straightforward.
Let's start with creating the primary image. In this example, I am using the prompt “photo of a turtle” throughout.
&lt;img alt="&amp;quot;photo of a turtle&amp;quot; generated with Stable Diffusion" src="https://vamvas.ch/assets/diffusion/primary_turtle.jpg" width="300px"&gt;&lt;/p&gt;
&lt;p&gt;Given a primary image, we can then generate the component images top-down. Let's divide the image into tiles (I am using a 64×64 grid).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Schema of mosaic generation using Stable Diffusion&amp;quot;" src="https://vamvas.ch/assets/diffusion/turtle_grid.png" width="400px"&gt;&lt;/p&gt;
&lt;p&gt;We can then upsample each tile to 512×512px and use it as an input for generating another photo of a turtle.
As you can see in the example above, this process (called &lt;em&gt;img2img&lt;/em&gt;) preserves the color gradient in the original tile, which is important for rendering the details of the primary turtle.&lt;/p&gt;
&lt;p&gt;As a result, a mosaic generated with Stable Diffusion can be much more detailed than a traditional photomosaic.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Photomosaic of a turtle created with Stable Diffusion" src="https://vamvas.ch/assets/diffusion/turtle_mosaic.jpg" width="300px"&gt;&lt;/p&gt;
&lt;p&gt;You could argue that a mosaic made out of generated images is somewhat pointless, and I agree.
The reason image generators are interesting is that it is now possible to create what I call a &lt;em&gt;hypermosaic&lt;/em&gt;, which would be very difficult without an image generator.&lt;/p&gt;
&lt;p&gt;A hypermosaic is an image that is composed of other images, which in turn are composed of other images, and so forth – a hypermosaic has infinite resolution! To stick to the turtle example: A hypermosaic is “photo of a turtle” all the way down.&lt;/p&gt;
&lt;p&gt;With some manual stitching I was able to create a proof of concept within hours based purely on generated images of turtles.
The looped video in the tweet above is the outcome (&lt;a href="https://giphy.com/gifs/vtkeyQYk0FbHuEJkYF"&gt;here is a 10MB GIF&lt;/a&gt;).
Let me know in the comments what you think. My personal opinion: It could make a nice screensaver!&lt;/p&gt;
&lt;h2&gt;Prompt Ensembling&lt;/h2&gt;
&lt;blockquote class="twitter-tweet" data-dnt="true"&gt;&lt;p lang="en" dir="ltr"&gt;Not sure if people have already been doing that, but I find it amazingly easy to ensemble &lt;a href="https://twitter.com/hashtag/stablediffusion?src=hash&amp;amp;ref_src=twsrc%5Etfw"&gt;#stablediffusion&lt;/a&gt; conditioned on two different prompts.&lt;br&gt;&lt;br&gt;Can you guess the two prompts behind these photos? &lt;a href="https://t.co/pkyuwzRPce"&gt;pic.twitter.com/pkyuwzRPce&lt;/a&gt;&lt;/p&gt;&amp;mdash; Jannis Vamvas (@j_vamvas) &lt;a href="https://twitter.com/j_vamvas/status/1569754993640591362?ref_src=twsrc%5Etfw"&gt;September 13, 2022&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;When people explore the capabilities of image generators, they often try to combine concepts that are rarely seen together (chair+avocado or astronaut+horse).
However, Stable Diffusion cannot combine all concepts out of the box. For example, this is what you get for the prompt “photo of a giraffe that looks like a frog”:&lt;/p&gt;
&lt;p&gt;&lt;img alt="&amp;quot;photo of a giraffe that looks like a frog&amp;quot; generated with Stable Diffusion" src="https://vamvas.ch/assets/diffusion/giraffe_looks_like_frog.jpg" width="300px"&gt;&lt;/p&gt;
&lt;p&gt;Not too impressive, right? You get something similar if you put “photo of a mixture of giraffe and frog”:&lt;/p&gt;
&lt;p&gt;&lt;img alt="&amp;quot;photo of a mixture of giraffe and frog&amp;quot; generated with Stable Diffusion" src="https://vamvas.ch/assets/diffusion/mixture_giraffe_frog.jpg" width="300px"&gt;&lt;/p&gt;
&lt;p&gt;Thus, mixing giraffes and frogs calls for a more hands-on approach. 
A concept that is well-known in machine translation is to combine multiple models at inference time (&lt;em&gt;ensembling&lt;/em&gt;).
When working with a single model, one can also combine multiple instances of the same model that are each provided with a different input.
Here this would mean to combine an instance of the image generator that is conditioned on ”photo of a giraffe“ with one that is conditioned on ”photo of a frog“.&lt;/p&gt;
&lt;p&gt;This approach can also be understood as an interpolation of two conditional models, but to avoid confusion with &lt;a href="https://replicate.com/andreasjansson/stable-diffusion-animation"&gt;&lt;em&gt;prompt interpolation&lt;/em&gt;&lt;/a&gt; I will use the term &lt;em&gt;ensembling&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Let's look at a nice schema of Stable Diffusion &lt;a href="https://huggingface.co/blog/stable_diffusion"&gt;created by Hugging Face&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Schema of stable diffusion" src="https://vamvas.ch/assets/diffusion/stable_diffusion_hf_schema.png" width="450px"&gt;
&lt;em&gt;&lt;a href="https://github.com/patrickvonplaten/scientific_images/blob/fb2c069965517c09e17b10e1a29da2ce2e3eb599/stable_diffusion.png"&gt;Image source&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;According to the schema, &lt;em&gt;conditioned latents&lt;/em&gt; are iteratively created by conditioning a UNet on the embedded user prompt and a previous latent.
A straightforward way to ensemble two prompts would then be to average the two conditioned latents at each step before feeding them back into the UNets:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Schema of prompt ensembling with latent diffusion" src="https://vamvas.ch/assets/diffusion/prompt_ensembling_schema.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/jvamvas/diffusers/commit/2cbc13a2c1e18d09b78f2db8a5146b14d88cc73a"&gt;Here is an implementation of this idea that I made using the Hugging Face diffusers library.&lt;/a&gt;
I just needed to add 18 lines.&lt;/p&gt;
&lt;p&gt;If we now ensemble the prompts “photo of a giraffe” and “photo of a frog” (with default settings and for 80 steps), this is what we get:&lt;/p&gt;
&lt;p&gt;&lt;img alt="“photo of a giraffe” ensembled with “photo of a frog”, generated with Stable Diffusion" src="https://vamvas.ch/assets/diffusion/simple_ensemble_result.jpg" width="300px"&gt;&lt;/p&gt;
&lt;p&gt;The ensembling works in the sense that both prompts are represented in the images. However, the model has a hard time combining them in a meaningful or creative way.
In the upper right image, we just see a frog next to a tiny giraffe (which could be seen as a local minimum of what we want to achieve).
So it seems we need to help the model further to converge to a unified object.&lt;/p&gt;
&lt;p&gt;We can do this by ensembling the prompts “photo of a giraffe that looks like a frog” and “photo of a frog that looks like a giraffe” (which we had used individually in the first example).
Now finally we are receiving acceptable images, some of which could even be considered useful and interesting:&lt;/p&gt;
&lt;p&gt;&lt;img alt="“photo of a giraffe that looks like a frog” ensembled with “photo of a frog that looks like a giraffe”, generated with Stable Diffusion" src="https://vamvas.ch/assets/diffusion/final_ensemble_result.jpg" width="300px"&gt;&lt;/p&gt;
&lt;h2&gt;Picture Frame Inpainting&lt;/h2&gt;
&lt;p&gt;To conclude this post, I would like to share my attempt to put Stable Diffusion into real-world use.&lt;/p&gt;
&lt;p&gt;An idea that has achieved quite a lot of attention is inpainting, i.e., completing a region inside the image by using the remainder of the image as context.
For example, there is a popular &lt;a href="https://github.com/AUTOMATIC1111/stable-diffusion-webui"&gt;web UI for Stable Diffusion&lt;/a&gt; that supports inpainting, and there are plugins for various graphics editors.&lt;/p&gt;
&lt;p&gt;I decided to try out inpainting by inpainting the inside of actual picture frames:&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-dnt="true"&gt;&lt;p lang="en" dir="ltr"&gt;For people looking to experiment with &lt;a href="https://twitter.com/hashtag/stablediffusion?src=hash&amp;amp;ref_src=twsrc%5Etfw"&gt;#stablediffusion&lt;/a&gt;, I can truly recommend the Krita plugin by &lt;a href="https://twitter.com/NicolayMausz?ref_src=twsrc%5Etfw"&gt;@NicolayMausz&lt;/a&gt; &lt;br&gt;&lt;br&gt;Below: Inpainting picture frames of various styles. Prompt: painting of a dog &lt;a href="https://t.co/9OFQ48hzrN"&gt;pic.twitter.com/9OFQ48hzrN&lt;/a&gt;&lt;/p&gt;&amp;mdash; Jannis Vamvas (@j_vamvas) &lt;a href="https://twitter.com/j_vamvas/status/1567442111212953600?ref_src=twsrc%5Etfw"&gt;September 7, 2022&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;My hope was that the image generator would ”attend“ to the surroundings of the picture frame when inpainting the image into the frame.
It is difficult to estimate to what degree this actually happens, and my experiments were not always successful.
However, in the (cherry-picked) images above, there are some indications that the image generator considers the context sometimes.
For example, on the upper right the dog's colors match the photo from the IKEA product catalog (&lt;a href="https://www.ikea.com/ch/de/p/lomviken-rahmen-schwarz-30286770/"&gt;original image&lt;/a&gt;).
And on the lower left, the dog's hair might reflect the wooden floor of the apartment.&lt;/p&gt;
&lt;p&gt;As a final digression I decided to make a step into the physical world and to let Stable Diffusion inpaint an image into my own apartment.
There was a picture frame that had long been empty:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Photo of an empty picture frame on a book shelf" src="https://vamvas.ch/assets/diffusion/picture_frame_empty.jpg" width="400px"&gt;&lt;/p&gt;
&lt;p&gt;Using the prompt “drawing of an animal”, I then inpainted the frame virtually.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Photo of an inpainted picture frame on a book shelf" src="https://vamvas.ch/assets/diffusion/picture_frame_inpainted.jpg" width="400px"&gt;&lt;/p&gt;
&lt;p&gt;After a few tries I got an acceptable result.
I was not completely satisfied with the colors, because many of the inpaintings were gray or had a very faint green.
I suspected this was caused by the white bookshelf and the wall, but was not able to get to the root of this phenomenon.&lt;/p&gt;
&lt;p&gt;In order to have an actual art print, I had to upscale the inpainted image.&lt;/p&gt;
&lt;p&gt;I scaled and padded the image to 512×512 and used the &lt;em&gt;img2img&lt;/em&gt; function of Stable Diffusion to create a full-resolution image.
For that I used the prompt “impressionist painting of a small deer hiding inside vegetation, high-resolution art print”.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Enhancing an inpainted image generated with Stable Diffusion using img2img" src="https://vamvas.ch/assets/diffusion/inpaint_img2img.jpg" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;In a second step, I upscaled the image with ESRGAN to get to about 240 DPI.
The print is now standing on my bookshelf and, I hope, will decorate the apartment for many years to come.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A real-life picture frame containing a drawing generated with Stable Diffusion" src="https://vamvas.ch/assets/diffusion/completed_picture_frame.jpg" width="600px"&gt;&lt;/p&gt;</content><category term="blog"></category></entry><entry><title>Lost and Found in Translation</title><link href="https://vamvas.ch/lost-and-found-in-translation" rel="alternate"></link><published>2022-05-25T00:00:00+02:00</published><updated>2022-05-25T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2022-05-25:/lost-and-found-in-translation</id><summary type="html">&lt;p&gt;Hypothetical reasoning can detect overtranslations and undertranslations.&lt;/p&gt;</summary><content type="html">&lt;p&gt;This blog post is a brief introduction to a paper presented at ACL 2022, titled &lt;a href="https://aclanthology.org/2022.acl-short.53/"&gt;"As Little as Possible, as Much as Necessary:
Detecting Over- and Undertranslations with Contrastive Conditioning"&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Coverage errors in MT&lt;/h2&gt;
&lt;p&gt;Neural machine translation (NMT) has greatly improved over the last years, but there are still a few typical errors that afflict NMT even in high-resource language pairs.
One common error type is &lt;em&gt;addition&lt;/em&gt; or &lt;em&gt;omission&lt;/em&gt; of content, which is also sometimes called &lt;em&gt;overtranslation&lt;/em&gt; or &lt;em&gt;undertranslation&lt;/em&gt;, or simply an error of &lt;em&gt;incomplete coverage&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;For example, a commercial MT system has recently translated the following English sentence into German (&lt;a href="https://github.com/google/wmt-mqm-human-evaluation"&gt;source&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Source: &amp;quot;The government, reeling from low oil prices, says it hopes tourism will contribute up to 10 percent of the gross domestic product by 2030, compared to three percent currently.&amp;quot; Translation: &amp;quot;Die Regierung hofft, dass der Tourismus bis 2030 bis zu 10 Prozent des Bruttoinlandsprodukts ausmachen wird, verglichen mit derzeit drei Prozent.&amp;quot;" src="https://vamvas.ch/assets/coverage/online-b-omission-error.png" width="700px"&gt;&lt;/p&gt;
&lt;p&gt;German speakers can confirm that the output sounds fluent.
However, the phrase &lt;em&gt;“reeling from low oil prices”&lt;/em&gt; has not been translated, and so a crucial piece of information is missing in the translation.&lt;/p&gt;
&lt;h2&gt;Contrastive Conditioning&lt;/h2&gt;
&lt;p&gt;In a &lt;a href="https://aclanthology.org/2022.acl-short.53/"&gt;short paper presented at ACL 2022&lt;/a&gt;, we propose a new method for automatically identifying such coverage errors.&lt;/p&gt;
&lt;p&gt;The approach that we use is &lt;em&gt;contrastive conditioning&lt;/em&gt;, and I have written about it in a &lt;a href="/evaluating-black-box-mt-with-contrastive-conditioning"&gt;previous blog post&lt;/a&gt;. We had developed it originally to detect word sense disambiguation errors.&lt;/p&gt;
&lt;p&gt;Our idea was to score a translation conditioned on contrastive source sequences. If the translation looks probable for a given source sequence, the latter can tell us something about the translation:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure illustrating the concept of contastive conditioning" src="https://vamvas.ch/assets/coverage/contrastive-conditioning.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;In other words, we try to infer properties of a translation by trying different hypothetical source sequences and checking which ones are most plausible for the translation.
This can be done automatically, using an off-the-shelf NMT system.&lt;/p&gt;
&lt;h2&gt;Minimal Example&lt;/h2&gt;
&lt;p&gt;In this paper, we use contrastive conditioning to detect coverage errors.
Let me demonstrate this on the example of a short, made-up translation:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Please exit the plane after landing. =&amp;gt; Bitte verlassen Sie das Flugzeug." src="https://vamvas.ch/assets/coverage/minimal-omission-error.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;This translation contains an omission error: The phrase “after landing” seems to have been lost in translation.&lt;/p&gt;
&lt;p&gt;Our idea is that this could be detected by conditioning the translation on another, hypothetical source sequence. Specifically, we expect that the source sequence “Please exit the plane” should have a higher likelihood than the actual one.&lt;/p&gt;
&lt;p&gt;So let’s take &lt;a href="https://huggingface.co/facebook/mbart-large-50-one-to-many-mmt"&gt;an open-source NMT model&lt;/a&gt; and verify:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Probability of 'Bitte verlassen Sie das Flugzeug.' given the original source and a partial source" src="https://vamvas.ch/assets/coverage/minimal-example-for-coverage-contrastive-conditioning.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;Indeed, the NMT model assigns a higher probability to the hypothetical source sequence that does not contain “after landing”.&lt;/p&gt;
&lt;p&gt;Note that we could use any NMT model for this. It could be the same system that created the translation, but does not have to be.&lt;/p&gt;
&lt;h2&gt;Searching for Errors&lt;/h2&gt;
&lt;p&gt;After this proof of concept, let's try if we can spot other omission errors with this method – beyond this made-up example.&lt;/p&gt;
&lt;p&gt;When analyzing a translation, we perform an exhaustive search over all the phrases that might be missing in the translation, and we compare the translation probability conditioned on a partial source sequence (which the phrase removed) to the probability conditioned on the original source sequence:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Probabilities of 'Bitte verlassen Sie das Flugzeug.' given a list of all partial sources" src="https://vamvas.ch/assets/coverage/omission-error-search.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;I have grayed out partial sources that can be skipped, since these parts of speech are unlikely to give useful results.&lt;/p&gt;
&lt;p&gt;(For example, the article “the” is not a content word, and it is unlikely that there is a coverage error involving just an article.
After all, we are mainly interested in so-called constituents, which are sometimes defined as word spans that can be removed from a sentence without rendering it ungrammatical. We approximate the concept of constituents by creating a dependency tree and only selecting nodes that meet certain conditions.)&lt;/p&gt;
&lt;p&gt;Checking for addition errors can be done in an analogous way:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Reconstruction probabilities of 'Please exit the plane after landing.' given a list of all partial translations" src="https://vamvas.ch/assets/coverage/addition-error-search.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;We estimate the reverse probabilities using an NMT model that translates in the reverse direction. In the above example, no partial source has a higher likelihood that the original source, since the translation does not contain an addition error.&lt;/p&gt;
&lt;h2&gt;Real-world Example&lt;/h2&gt;
&lt;p&gt;Finally, let's apply the algorithm to the real-world example I mentioned at the beginning of this post, where “reeling from low oil prices” was missing in a lengthy sentence.
I'm going to use &lt;a href="https://github.com/ZurichNLP/coverage-contrastive-conditioning"&gt;our Python implementation&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;coverage.evaluator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CoverageEvaluator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;translation_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_forward_and_backward_model&lt;/span&gt;

&lt;span class="n"&gt;forward_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backward_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_forward_and_backward_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;mbart50&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src_lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;en&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tgt_lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;de&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;evaluator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CoverageEvaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;src_lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;en&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tgt_lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;de&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;forward_evaluator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;forward_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;backward_evaluator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;backward_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;The government, reeling from low oil prices, says it hopes tourism will contribute up to 10 percent of the gross domestic product by 2030, compared to three percent currently.&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;translation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Die Regierung hofft, dass der Tourismus bis 2030 bis zu 10 Prozent des Bruttoinlandsprodukts ausmachen wird, verglichen mit derzeit drei Prozent.&amp;quot;&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;detect_errors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;translation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Omission errors: reeling from low oil prices | from low oil prices | low | oil&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Looking at the output, it seems that the missing phrase is correctly identified.&lt;/p&gt;
&lt;h2&gt;Evaluation Results&lt;/h2&gt;
&lt;p&gt;In the paper, we describe how we evaluated our approach on a dataset of real-world machine translation errors created by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey.
&lt;span class="bibtex-protected"&gt;Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation&lt;/span&gt;.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 9:1460&amp;ndash;1474, 12 2021.
URL: &lt;a href="https://doi.org/10.1162/tacl\_a\_00437"&gt;https://doi.org/10.1162/tacl\_a\_00437&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00437/1979261/tacl\_a\_00437.pdf"&gt;arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00437/1979261/tacl\_a\_00437.pdf&lt;/a&gt;.' href='#freitag2021experts' id='ref-freitag2021experts-1'&gt;Freitag et al. (2021)&lt;/a&gt;.
We also perform a human evaluation of word-level precision, in order to better understand what our algorithm gets right and when it fails.
The language pairs we evaluate on are English–German and Chinese–English.&lt;/p&gt;
&lt;p&gt;As a supervised baseline we use a token classification system (based on &lt;a href="http://dx.doi.org/10.18653/v1/2020.acl-main.747"&gt;XLM-Roberta&lt;/a&gt;) that outputs whether a source token is omitted in the translation, and whether a target token is an addition error.
This approach is based on previous work on token-level quality estimation and was implemented with &lt;a href="https://github.com/Unbabel/OpenKiwi"&gt;OpenKiwi&lt;/a&gt; &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Fabio Kepler, Jonay Tr&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;nous, Marcos Treviso, Miguel Vera, and Andr&lt;span class="bibtex-protected"&gt;é&lt;/span&gt; F.&amp;nbsp;T. Martins.
&lt;span class="bibtex-protected"&gt;O&lt;/span&gt;pen&lt;span class="bibtex-protected"&gt;K&lt;/span&gt;iwi: an open source framework for quality estimation.
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations&lt;/em&gt;, 117–122. Florence, Italy, July 2019. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/P19-3020"&gt;https://aclanthology.org/P19-3020&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/P19-3020"&gt;doi:10.18653/v1/P19-3020&lt;/a&gt;.' href='#kepler-etal-2019-openkiwi' id='ref-kepler-etal-2019-openkiwi-1'&gt;(Kepler et al., 2019)&lt;/a&gt;.
We trained the supervised baseline on a large-scale dataset of synthetic coverage errors.
In the paper we describe in more detail how we created this dataset, and we release it alongside the &lt;a href="https://github.com/ZurichNLP/coverage-contrastive-conditioning"&gt;code on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;On the segment level we find that our algorithm has similar or higher accuracy than the supervised baselines. It is especially accurate for omission errors:
&lt;img alt="Segment-level evaluation results" src="https://vamvas.ch/assets/coverage/segment-level-evaluation.png" width="800px"&gt;&lt;/p&gt;
&lt;p&gt;Regarding addition errors, the accuracy of both methods is likely too small to be helpful. But there are fewer positive examples of addition errors in the dataset, which makes it difficult to achieve a high accuracy.&lt;/p&gt;
&lt;p&gt;The human evaluation shows us that the word-level precision is comparable to the segment-level accuracy.
And it turns out a portion of the detected word spans are actually different types of errors:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Word-level human evaluation results" src="https://vamvas.ch/assets/coverage/word-level-evaluation.png" width="800px"&gt;&lt;/p&gt;
&lt;p&gt;The light blue area in the figure means that a detected word span is indeed a translation error, but it is e.g. an accuracy error and not a coverage error.
There is also a large amount of false positives, where our human annotators did not find anything wrong with the highlighted word spans, especially with addition errors.
In our paper, we show examples for these phenomena.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;We have demonstrated a reference-free method to automatically detect coverage errors in translations. Specifically, our method relies on hypothetical reasoning using contrastive conditioning.&lt;/p&gt;
&lt;p&gt;An advantage of our approach is that it does not require a specifically trained model, such as a quality estimation model. Instead we use an off-the-shelf NMT model, which could also be the model that originated the translation in the first place.&lt;/p&gt;
&lt;p&gt;Given the encouraging accuracy on omission errors, it would be interesting to see user studies on whether their automatic detection could aid translators and post-editors.
On the other hand the detection of addition errors seems to be more challenging, and is still an open problem.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The work presented in this post is joint work with my PhD supervisor &lt;a href="https://www.cl.uzh.ch/de/people/team/compling/sennrich.html"&gt;Rico Sennrich&lt;/a&gt;. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='freitag2021experts'&gt;Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey.
&lt;span class="bibtex-protected"&gt;Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation&lt;/span&gt;.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 9:1460&amp;ndash;1474, 12 2021.
URL: &lt;a href="https://doi.org/10.1162/tacl\_a\_00437"&gt;https://doi.org/10.1162/tacl\_a\_00437&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00437/1979261/tacl\_a\_00437.pdf"&gt;arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00437/1979261/tacl\_a\_00437.pdf&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-freitag2021experts-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='kepler-etal-2019-openkiwi'&gt;Fabio Kepler, Jonay Tr&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;nous, Marcos Treviso, Miguel Vera, and Andr&lt;span class="bibtex-protected"&gt;é&lt;/span&gt; F.&amp;nbsp;T. Martins.
&lt;span class="bibtex-protected"&gt;O&lt;/span&gt;pen&lt;span class="bibtex-protected"&gt;K&lt;/span&gt;iwi: an open source framework for quality estimation.
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations&lt;/em&gt;, 117–122. Florence, Italy, July 2019. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/P19-3020"&gt;https://aclanthology.org/P19-3020&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/P19-3020"&gt;doi:10.18653/v1/P19-3020&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-kepler-etal-2019-openkiwi-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>NMTScore: Text Similarity via Translation</title><link href="https://vamvas.ch/nmtscore-text-similarity-via-translation" rel="alternate"></link><published>2022-04-29T00:00:00+02:00</published><updated>2022-04-29T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2022-04-29:/nmtscore-text-similarity-via-translation</id><summary type="html">&lt;p&gt;Multilingual translation models offer surprising ways of comparing two sentences.&lt;/p&gt;</summary><content type="html">&lt;p&gt;While neural machine translation&amp;nbsp;(NMT) is mainly being used for &lt;em&gt;translating&lt;/em&gt; text, it is also useful for &lt;em&gt;comparing&lt;/em&gt; text. 
This bonus feature of NMT is promising for some areas where attention to detail matters.
We have released &lt;a href="https://github.com/ZurichNLP/nmtscore"&gt;a new Python library&lt;/a&gt; as well as &lt;a href="https://aclanthology.org/2022.findings-emnlp.15/"&gt;a paper accepted to Findings of EMNLP 2022&lt;/a&gt; that compares translation-based similarity measures to baselines such as sentence embeddings.&lt;/p&gt;
&lt;p&gt;In this post I summarize our findings.&lt;/p&gt;
&lt;h2&gt;The concept of translation probability&lt;/h2&gt;
&lt;p&gt;At the core of an NMT system, there is a translation model that estimates the probability of any translation, given the source sequence. For example, a good translation model will tell us that "Bonjour" can probably be translated into English as "Hello", but that "Goodbye" would be an improbable translation.&lt;/p&gt;
&lt;p&gt;Let me demonstrate this using our Python library, &lt;a href="https://github.com/ZurichNLP/nmtscore"&gt;NMTScore&lt;/a&gt;. First, let's download an open-source NMT model from HuggingFace:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;nmtscore.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_translation_model&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_translation_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;m2m100_418M&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This model, called &lt;a href="https://huggingface.co/facebook/m2m100_418M"&gt;M2M100&lt;/a&gt;, has been released by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, and Armand Joulin.
Beyond english-centric multilingual machine translation.
&lt;em&gt;Journal of Machine Learning Research&lt;/em&gt;, 22(107):1&amp;ndash;48, 2021.
URL: &lt;a href="http://jmlr.org/papers/v22/20-1307.html"&gt;http://jmlr.org/papers/v22/20-1307.html&lt;/a&gt;.' href='#fan2021beyond' id='ref-fan2021beyond-1'&gt;Fan et al. (2021)&lt;/a&gt;.
Let's ask the model to estimate some translation probabilities for us:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;en&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bonjour !&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Hello to you!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# [0.35]&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;en&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bonjour !&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Sleep well!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# [0.11]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first argument tells the model about the target language. Of course, if the target language were German instead of English, the translation would become less probable:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;de&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bonjour !&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Hello to you!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# [0.04]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Multilingual translation models&lt;/h2&gt;
&lt;p&gt;Which brings us to the concept of a multilingual translation model.
Multilingual models are trained jointly on multiple source languages and/or target languages.
For example, M2M100 is a many-to-many model that translates between no less than 100&amp;nbsp;languages.&lt;/p&gt;
&lt;p&gt;As shown above, a multilingual model only needs to know the target language, but it can infer the source language by itself.
For example, we can create a translation from Swedish without explicitly telling the model that the input is Swedish:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;translate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;en&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Hej Hanna, hur är läget?&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# [&amp;#39;Hi Hanna, how is it?&amp;#39;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is an interesting property, and a side effect is that the input language is allowed to be identical to the target language:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;translate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;en&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Hi Hanna, how are you?&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# [&amp;#39;Hi Hanna, how are you?&amp;#39;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above is sometimes called &lt;em&gt;zero-shot paraphrasing&lt;/em&gt; &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Brian Thompson and Matt Post.
Automatic machine translation evaluation in many languages via zero-shot paraphrasing.
In &lt;em&gt;Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)&lt;/em&gt;, 90–121. Online, November 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.emnlp-main.8"&gt;https://aclanthology.org/2020.emnlp-main.8&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.emnlp-main.8"&gt;doi:10.18653/v1/2020.emnlp-main.8&lt;/a&gt;.' href='#thompson-post-2020-automatic' id='ref-thompson-post-2020-automatic-1'&gt;(Thompson and Post, 2020)&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Different ways to compare two sentences&lt;/h2&gt;
&lt;p&gt;The basic principle behind NMTScore is that translation probabilities can be used to compare two sentences. For example, "Hello", "Good&amp;nbsp;day", "Bonjour, and "Hej" all have similar meaning. On the other hand, "Hello", "Sleep well" and "Schadenfreude" are not similar with respect to their meaning.&lt;/p&gt;
&lt;p&gt;In the past, researchers have already come up with creative ways of leveraging NMT to compare a sentence&amp;nbsp;A to a sentence&amp;nbsp;B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="An incomplete matrix of translation-based similarity measures" src="https://vamvas.ch/assets/nmtscore/translation-based-similarity-measures-v0.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;For example, the translation probability of&amp;nbsp;A given&amp;nbsp;B can be used directly as a similarity measure (left-hand side; &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Marcin Junczys-Dowmunt.
Dual conditional cross-entropy filtering of noisy parallel corpora.
In &lt;em&gt;Proceedings of the Third Conference on Machine Translation: Shared Task Papers&lt;/em&gt;, 888–895. Belgium, Brussels, October 2018. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/W18-6478"&gt;https://aclanthology.org/W18-6478&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/W18-6478"&gt;doi:10.18653/v1/W18-6478&lt;/a&gt;.' href='#junczys-dowmunt-2018-dual' id='ref-junczys-dowmunt-2018-dual-1'&gt;JunczysDowmunt (2018)&lt;/a&gt;, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Brian Thompson and Matt Post.
Automatic machine translation evaluation in many languages via zero-shot paraphrasing.
In &lt;em&gt;Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)&lt;/em&gt;, 90–121. Online, November 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.emnlp-main.8"&gt;https://aclanthology.org/2020.emnlp-main.8&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.emnlp-main.8"&gt;doi:10.18653/v1/2020.emnlp-main.8&lt;/a&gt;.' href='#thompson-post-2020-automatic' id='ref-thompson-post-2020-automatic-2'&gt;Thompson and Post (2020)&lt;/a&gt;). Alternatively one can also estimate the probability that&amp;nbsp;A is a translation of&amp;nbsp;B via a pivot language, and use that as a similarity measure (right-hand side; &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jonathan Mallinson, Rico Sennrich, and Mirella Lapata.
Paraphrasing revisited with neural machine translation.
In &lt;em&gt;Proceedings of the 15th Conference of the &lt;span class="bibtex-protected"&gt;E&lt;/span&gt;uropean Chapter of the Association for Computational Linguistics: Volume 1, Long Papers&lt;/em&gt;, 881–893. Valencia, Spain, April 2017. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/E17-1083"&gt;https://aclanthology.org/E17-1083&lt;/a&gt;.' href='#mallinson-etal-2017-paraphrasing' id='ref-mallinson-etal-2017-paraphrasing-1'&gt;Mallinson et al. (2017)&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Both approaches are useful if&amp;nbsp;A and&amp;nbsp;B are in two different languages (upper row), and also if&amp;nbsp;A and&amp;nbsp;B are in the same language (bottom row). The reason for that is that multilingual NMT models do not need to know the language of their input.&lt;/p&gt;
&lt;p&gt;When creating the matrix above, we found that an interesting variant has not been tried before, and we added that column to the matrix:&lt;/p&gt;
&lt;p&gt;&lt;img alt="An matrix of translation-based similarity measures that includes translation cross-likelihood" src="https://vamvas.ch/assets/nmtscore/translation-based-similarity-measures.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;The idea of &lt;em&gt;translation cross-likelihood&lt;/em&gt; is that both&amp;nbsp;A and&amp;nbsp;B are translated into a target language (e.g.&amp;nbsp;English). Specifically, we ask the model whether a translation of&amp;nbsp;B could also be a good translation of&amp;nbsp;A.&lt;/p&gt;
&lt;p&gt;Again, this approach works both for sentences in the same language, and cross-lingually. The translation cross-likelihood measure has some other nice properties. For example, it is somewhat more symmetrical.&lt;/p&gt;
&lt;h2&gt;Advantages of translation-based text similarity&lt;/h2&gt;
&lt;p&gt;In our &lt;a href="https://aclanthology.org/2022.findings-emnlp.15/"&gt;paper&lt;/a&gt;, we evaluate the different variants of NMTScore in two settings: multilingual paraphrase identification, and multilingual reference-based evaluation of generated text.&lt;/p&gt;
&lt;p&gt;The goal of the former is to find out whether two sentences are paraphrases of each other. A similarity measure has high accuracy if it assigns higher similarity to the paraphrases in a dataset than to the non-paraphrases.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bar chart of paraphrase identification accuracies of different similarity measures" src="https://vamvas.ch/assets/nmtscore/paraphrase-identification-results.png" width="700px"&gt;&lt;/p&gt;
&lt;p&gt;Overall, we found that NMTScore is competitive compared to common baselines such as embeddings derived from a pre-trained language model.
We also propose a normalization scheme, called &lt;em&gt;reconstruction normalization&lt;/em&gt;, and we show that it contributes to the high accuracy of NMTScore.&lt;/p&gt;
&lt;p&gt;NMTScore is especially good with adversarial examples, where deceptively similar sentences pairs need to be distinguished &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Yuan Zhang, Jason Baldridge, and Luheng He.
&lt;span class="bibtex-protected"&gt;PAWS&lt;/span&gt;: paraphrase adversaries from word scrambling.
In &lt;em&gt;Proceedings of the 2019 Conference of the North &lt;span class="bibtex-protected"&gt;A&lt;/span&gt;merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)&lt;/em&gt;, 1298–1308. Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/N19-1131"&gt;https://aclanthology.org/N19-1131&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/N19-1131"&gt;doi:10.18653/v1/N19-1131&lt;/a&gt;.' href='#zhang-etal-2019-paws' id='ref-zhang-etal-2019-paws-1'&gt;(Zhang et al., 2019)&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Adversarial example where NMTScore is more accurate than BERTScore" src="https://vamvas.ch/assets/nmtscore/adversarial-example.png" width="700px"&gt;&lt;/p&gt;
&lt;p&gt;Here's the code to reproduce the figure:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;nmtscore&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NMTScorer&lt;/span&gt;
&lt;span class="n"&gt;scorer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;NMTScorer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;prism&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;Flights from New York to Florida.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;Flights from Florida to New York.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 0.67&lt;/span&gt;
&lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;Flights from New York to Florida.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;Flights to Florida from New York.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 0.71&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In the second evaluation setting – reference-based evaluation – we find a competitive performance to the baselines.
This is especially relevant to NLP researchers who evaluate and compare text generation systems such as data-to-text systems.
The researchers often use similarity measures (like BLEU) to compare the system output to a reference output, and NMTScore seems to be a relatively reliable choice for such a measure.&lt;/p&gt;
&lt;h2&gt;Outlook&lt;/h2&gt;
&lt;p&gt;Our library is &lt;a href="https://github.com/ZurichNLP/nmtscore"&gt;available on GitHub&lt;/a&gt;. If you'd like to share a use case or a suggestion, please create an issue to let us know.&lt;/p&gt;
&lt;p&gt;In summary, NMTScore and its variants are an attractive complement to other similarity measures.
Keep in mind, however, that the open-source NMT models we use perform especially well with shorter text segments, and do not support all language pairs equally well.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The work presented in this post is joint work with my PhD supervisor &lt;a href="https://www.cl.uzh.ch/de/people/team/compling/sennrich.html"&gt;Rico Sennrich&lt;/a&gt;. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='fan2021beyond'&gt;Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, and Armand Joulin.
Beyond english-centric multilingual machine translation.
&lt;em&gt;Journal of Machine Learning Research&lt;/em&gt;, 22(107):1&amp;ndash;48, 2021.
URL: &lt;a href="http://jmlr.org/papers/v22/20-1307.html"&gt;http://jmlr.org/papers/v22/20-1307.html&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-fan2021beyond-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='junczys-dowmunt-2018-dual'&gt;Marcin Junczys-Dowmunt.
Dual conditional cross-entropy filtering of noisy parallel corpora.
In &lt;em&gt;Proceedings of the Third Conference on Machine Translation: Shared Task Papers&lt;/em&gt;, 888–895. Belgium, Brussels, October 2018. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/W18-6478"&gt;https://aclanthology.org/W18-6478&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/W18-6478"&gt;doi:10.18653/v1/W18-6478&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-junczys-dowmunt-2018-dual-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='mallinson-etal-2017-paraphrasing'&gt;Jonathan Mallinson, Rico Sennrich, and Mirella Lapata.
Paraphrasing revisited with neural machine translation.
In &lt;em&gt;Proceedings of the 15th Conference of the &lt;span class="bibtex-protected"&gt;E&lt;/span&gt;uropean Chapter of the Association for Computational Linguistics: Volume 1, Long Papers&lt;/em&gt;, 881–893. Valencia, Spain, April 2017. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/E17-1083"&gt;https://aclanthology.org/E17-1083&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-mallinson-etal-2017-paraphrasing-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='thompson-post-2020-automatic'&gt;Brian Thompson and Matt Post.
Automatic machine translation evaluation in many languages via zero-shot paraphrasing.
In &lt;em&gt;Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)&lt;/em&gt;, 90–121. Online, November 2020. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2020.emnlp-main.8"&gt;https://aclanthology.org/2020.emnlp-main.8&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2020.emnlp-main.8"&gt;doi:10.18653/v1/2020.emnlp-main.8&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-thompson-post-2020-automatic-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-thompson-post-2020-automatic-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-thompson-post-2020-automatic-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id='zhang-etal-2019-paws'&gt;Yuan Zhang, Jason Baldridge, and Luheng He.
&lt;span class="bibtex-protected"&gt;PAWS&lt;/span&gt;: paraphrase adversaries from word scrambling.
In &lt;em&gt;Proceedings of the 2019 Conference of the North &lt;span class="bibtex-protected"&gt;A&lt;/span&gt;merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)&lt;/em&gt;, 1298–1308. Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/N19-1131"&gt;https://aclanthology.org/N19-1131&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/N19-1131"&gt;doi:10.18653/v1/N19-1131&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-zhang-etal-2019-paws-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>The Limits of Minimal Sentence Pairs</title><link href="https://vamvas.ch/the-limits-of-minimal-sentence-pairs" rel="alternate"></link><published>2021-10-12T00:00:00+02:00</published><updated>2021-10-12T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2021-10-12:/the-limits-of-minimal-sentence-pairs</id><summary type="html">&lt;p&gt;Forced decisions between sentences are not always predictive of generated language.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In a &lt;a href="https://aclanthology.org/2021.blackboxnlp-1.5/"&gt;paper presented at BlackboxNLP 2021&lt;/a&gt;, we highlight a limitation of minimal sentence pairs when it comes to predicting generative behavior, and propose a simple technique for improving their predictiveness. This blog post is a brief introduction.&lt;/p&gt;
&lt;h2&gt;Why minimal sentence pairs are useful&lt;/h2&gt;
&lt;p&gt;Minimal sentence pairs &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.
Assessing the ability of &lt;span class="bibtex-protected"&gt;LSTM&lt;/span&gt;s to learn syntax-sensitive dependencies.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 4:521–535, 2016.
URL: &lt;a href="https://aclanthology.org/Q16-1037"&gt;https://aclanthology.org/Q16-1037&lt;/a&gt;, &lt;a href="https://doi.org/10.1162/tacl_a_00115"&gt;doi:10.1162/tacl_a_00115&lt;/a&gt;.' href='#linzen-etal-2016-assessing' id='ref-linzen-etal-2016-assessing-1'&gt;(Linzen et al., 2016)&lt;/a&gt; are frequently used for the contrastive evaluation of language generation models:&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example of a minimal sentence pair in English" src="https://vamvas.ch/assets/minimal-pairs/minimal-sentence-pair-example.png" width="415px"&gt;&lt;/p&gt;
&lt;p&gt;Sentences A and B differ in just one aspect.
In Sentence A, there is agreement between the subject and the verb, whereas in Sentence B, the number of the verb disagrees with the subject.
If a language model assigns a higher probability score to Sentence A and similar sentences than to Sentence B, this can be seen as a preference for subject-verb agreement.&lt;/p&gt;
&lt;p&gt;A great advantage of minimal sentence pairs is that they allow to automate the evaluation of language models while isolating a specific linguistic phenomenon.
But a question that has also been raised in previous work by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Benjamin Newman, Kai-Siang Ang, Julia Gong, and John Hewitt.
Refining targeted syntactic evaluation of language models.
In &lt;em&gt;Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies&lt;/em&gt;, 3710–3723. Online, June 2021. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2021.naacl-main.290"&gt;https://aclanthology.org/2021.naacl-main.290&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2021.naacl-main.290"&gt;doi:10.18653/v1/2021.naacl-main.290&lt;/a&gt;.' href='#newman-etal-2021-refining' id='ref-newman-etal-2021-refining-1'&gt;Newman et al. (2021)&lt;/a&gt; is how much minimal pairs tell us about the &lt;em&gt;generative behavior&lt;/em&gt; of a model.&lt;/p&gt;
&lt;p&gt;Contrastive evaluation is based on a forced decision between two predefined sequences.
However, at deployment time, end users are often exposed to the 1-best generated sequence, for example in machine translation or in dialogue.
The sequence that end users are seeing might be completely different from the choices given to the model at evaluation time:&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example of a sentence completion generated at deployment time" src="https://vamvas.ch/assets/minimal-pairs/actual-generated-text-example.png" width="500px"&gt;&lt;/p&gt;
&lt;h2&gt;Contrastive evaluation in MT&lt;/h2&gt;
&lt;p&gt;In our paper, we set out to explore the limits of minimal pairs. In order to do that, we perform some experiments in the context of neural machine translation (NMT).&lt;/p&gt;
&lt;p&gt;When evaluating NMT models, minimal pairs are used on the target side &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Rico Sennrich.
How grammatical is character-level neural machine translation? &lt;span class="bibtex-protected"&gt;A&lt;/span&gt;ssessing &lt;span class="bibtex-protected"&gt;MT&lt;/span&gt; quality with contrastive translation pairs.
In &lt;em&gt;Proceedings of the 15th Conference of the &lt;span class="bibtex-protected"&gt;E&lt;/span&gt;uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers&lt;/em&gt;, 376–382. Valencia, Spain, April 2017. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/E17-2060"&gt;https://aclanthology.org/E17-2060&lt;/a&gt;.' href='#sennrich-2017-grammatical' id='ref-sennrich-2017-grammatical-1'&gt;(Sennrich, 2017)&lt;/a&gt;.
Given a source sequence, two contrastive translation variants are presented to the model:&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example of a minimal sentence pair for contrastive evaluation in machine translation" src="https://vamvas.ch/assets/minimal-pairs/machine-translation-contrastive-evaluation-example.png" width="465px"&gt;&lt;/p&gt;
&lt;p&gt;The example above shows two German translation variants for an English source sentence. Again, translation A preserves subject-verb agreement and translation B does not.&lt;/p&gt;
&lt;p&gt;The probability score that is output by an NMT system is usually computed as the average log-likelihood of the target tokens, conditioned on the source sequence.
It can be expected that a good NMT system assigns a higher score to the correct translation variant.&lt;/p&gt;
&lt;p&gt;But do such minimal pairs always tell us how an NMT system will behave at deployment time?&lt;/p&gt;
&lt;h2&gt;Testing implausible hypotheses using minimal pairs&lt;/h2&gt;
&lt;p&gt;A straightforward way to explore the limits of such minimal pairs is to test an &lt;em&gt;implausible hypothesis about the generative behavior&lt;/em&gt; of NMT systems.&lt;/p&gt;
&lt;p&gt;In previous work, all the hypotheses that have been tested are more or less plausible, as for example the hypothesis that NMT systems observe subject-verb agreement.
However, when we test some implausible hypotheses about generative behavior, we find that the results based on minimal pairs still produce some evidence for these implausible hypotheses.&lt;/p&gt;
&lt;p&gt;Specifically, we look at two phenomena in the German language that are cognitive or social phenomena.
The first implausible hypothesis is that NMT systems use &lt;em&gt;vague language&lt;/em&gt; in the form of placeholder nouns very frequently. In spoken German, if someone doesn’t remember a noun, they might say ’&lt;em&gt;Ding&lt;/em&gt;’ instead, which means ‘thing’ or ‘thingy’:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example of the placeholder noun &amp;quot;Ding&amp;quot; in German" src="https://vamvas.ch/assets/minimal-pairs/vague-language-minimal-pair-example.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;The second implausible hypothesis is that NMT systems are frequently producing &lt;em&gt;hypercorrections&lt;/em&gt;. We look at hypercorrect genitives in German, where dative, rather than genitive, would be the more acceptable case for a preposition:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example for a hypercorrect genitive in German" src="https://vamvas.ch/assets/minimal-pairs/hypercorrect-genitive-minimal-pair-example.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;But why do we think that these phenomena are implausible? First of all, they are rarely found in the training data, so they would have to originate from somewhere else.
In human speech, the phenomena are caused by &lt;em&gt;cognitive and social factors&lt;/em&gt;, for example the tendency to forget a word when speaking, or the desire to attain social prestige. These factors do not apply to neural language models, making it implausible that they would produce the phenomena.&lt;/p&gt;
&lt;p&gt;When creating test sets of minimal pairs for these phenomena, we find that our NMT systems do not reach 100 percent accuracy:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Result showing that minimal pairs lead to false positives regarding implausible hypotheses" src="https://vamvas.ch/assets/minimal-pairs/minimal-pairs-false-positive-results.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;In other words, there are a certain number of instances where the NMT systems decide in favor of the translation containing vague language, or in favor of the translation containing a hypercorrection.&lt;/p&gt;
&lt;p&gt;Taken at face value, these results would indicate that the NMT systems do generate the phenomena occasionally, and also that &lt;a href="/when-mt-distillation-leads-to-bias"&gt;distilled NMT systems&lt;/a&gt; produce the phenomena more often than non-distilled models.
But if you look at actual machine translations, you will almost never find the phenomena (since they are very implausible).
So in this sense, minimal pairs would lead to &lt;em&gt;false positives&lt;/em&gt; about the generative behavior of NMT systems, making comparisons between systems more difficult.&lt;/p&gt;
&lt;h2&gt;Some minimal pairs are more predictive than others&lt;/h2&gt;
&lt;p&gt;One reason why minimal pairs are not entirely predictive of generative behavior is that they are not among the translations that the NMT system would generate by itself, given the source sequence:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Visualization of the distributional discrepancy of a minimal pair" src="https://vamvas.ch/assets/minimal-pairs/distributional-discrepancy.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;The &lt;em&gt;1-best translation&lt;/em&gt; has the highest probability score and is usually approximated using beam search in practice.
In comparison, the contrastive variants of the minimal pair usually have a lower probability score.&lt;/p&gt;
&lt;p&gt;One reason for that may be that they have been constructed by humans, and as such are sampled from a slightly different language distribution than what the system would generate by itself.
This discrepancy might be especially large for distilled NMT systems, since they have never been exposed to human-written text during training.&lt;/p&gt;
&lt;p&gt;It seems reasonable to assume that a high discrepancy of minimal pairs can hurt their predictiveness.
We checked this by creating a second set of minimal pairs from machine-generated references.&lt;/p&gt;
&lt;p&gt;We translated the same source sequences we used before with a variety of commercial MT systems, and constructed minimal pairs based on the machine translations:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example for a minimal sentence pair derived from machine translations" src="https://vamvas.ch/assets/minimal-pairs/machine-generated-minimal-pairs-example.png" width="520px"&gt;&lt;/p&gt;
&lt;p&gt;In the above example, the commercial MT system has output a translation with correct subject-verb agreement (A-MT). Like before, the incorrect variant can be created by changing the number of the verb (B-MT).
If you happen to speak German, you will also notice that the machine translation uses slightly simpler expressions than the human translation (A).&lt;/p&gt;
&lt;p&gt;When testing our two implausible hypotheses again, we now find that the test sets derived from machine-generated references produce fewer false positives, showing that these test sets are now more predictive of generative behavior:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Comparison of false positives for human-written and machine-generated minimal pairs" src="https://vamvas.ch/assets/minimal-pairs/distil-lingeval-results.png" width="100%"&gt;&lt;/p&gt;
&lt;h2&gt;The DistilLingEval test suite&lt;/h2&gt;
&lt;p&gt;The success that we had with machine-generated references inspired us to also release similar contrastive test sets for other linguistic phenomena.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/ZurichNLP/distil-lingeval"&gt;DistilLingEval&lt;/a&gt; is a test suite for 8 linguistic phenomena in English→German translation.
While similar to &lt;a href="https://github.com/rsennrich/lingeval97"&gt;LingEval97&lt;/a&gt;, our test suite has been created based on machine-generated references. As such, we expect it to be more predictive when it comes to generative behavior. Have a look at the &lt;a href="https://github.com/ZurichNLP/distil-lingeval"&gt;GitHub repo&lt;/a&gt; to find out more about the test suite.&lt;/p&gt;
&lt;p&gt;Our hope is that in future work, similar test sets will be created for other tasks, languages and linguistic phenomena.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The work presented in this post is joint work with my PhD supervisor &lt;a href="https://www.cl.uzh.ch/de/people/team/compling/sennrich.html"&gt;Rico Sennrich&lt;/a&gt;. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='linzen-etal-2016-assessing'&gt;Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.
Assessing the ability of &lt;span class="bibtex-protected"&gt;LSTM&lt;/span&gt;s to learn syntax-sensitive dependencies.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 4:521–535, 2016.
URL: &lt;a href="https://aclanthology.org/Q16-1037"&gt;https://aclanthology.org/Q16-1037&lt;/a&gt;, &lt;a href="https://doi.org/10.1162/tacl_a_00115"&gt;doi:10.1162/tacl_a_00115&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-linzen-etal-2016-assessing-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='newman-etal-2021-refining'&gt;Benjamin Newman, Kai-Siang Ang, Julia Gong, and John Hewitt.
Refining targeted syntactic evaluation of language models.
In &lt;em&gt;Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies&lt;/em&gt;, 3710–3723. Online, June 2021. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/2021.naacl-main.290"&gt;https://aclanthology.org/2021.naacl-main.290&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/2021.naacl-main.290"&gt;doi:10.18653/v1/2021.naacl-main.290&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-newman-etal-2021-refining-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='sennrich-2017-grammatical'&gt;Rico Sennrich.
How grammatical is character-level neural machine translation? &lt;span class="bibtex-protected"&gt;A&lt;/span&gt;ssessing &lt;span class="bibtex-protected"&gt;MT&lt;/span&gt; quality with contrastive translation pairs.
In &lt;em&gt;Proceedings of the 15th Conference of the &lt;span class="bibtex-protected"&gt;E&lt;/span&gt;uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers&lt;/em&gt;, 376–382. Valencia, Spain, April 2017. Association for Computational Linguistics.
URL: &lt;a href="https://aclanthology.org/E17-2060"&gt;https://aclanthology.org/E17-2060&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-sennrich-2017-grammatical-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>When MT Distillation Leads to Bias</title><link href="https://vamvas.ch/when-mt-distillation-leads-to-bias" rel="alternate"></link><published>2021-08-29T00:00:00+02:00</published><updated>2021-08-29T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2021-08-29:/when-mt-distillation-leads-to-bias</id><summary type="html">&lt;p&gt;Distilled translation models tend to overgeneralize.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In a &lt;a href="https://aclanthology.org/2021.emnlp-main.803/"&gt;paper presented at EMNLP 2021&lt;/a&gt;, we take a closer look at lexical overgeneralization in MT. The first part of the paper introduces a new technique for evaluating disambiguation, called &lt;strong&gt;&lt;em&gt;contrastive conditioning&lt;/em&gt;&lt;/strong&gt; (&lt;a href="/evaluating-black-box-mt-with-contrastive-conditioning"&gt;→ blog post&lt;/a&gt;). Here's an introduction to the second part of our paper: A case study showing that distilled MT models have a stronger overgeneralization bias.&lt;/p&gt;
&lt;h2&gt;The Impact of Disambiguation Errors&lt;/h2&gt;
&lt;p&gt;One of the great challenges of machine translation (MT) is inferring the correct word sense of ambiguous words.
There are different ways in which words can be ambiguous – a well-known example are nouns that can mean multiple things.&lt;/p&gt;
&lt;p&gt;For instance, the English noun &lt;em&gt;starter&lt;/em&gt; can refer to an appetizer or to a motor part.
Since German has different words for the two concepts, an MT system needs to decide between &lt;em&gt;Vorspeise&lt;/em&gt; and &lt;em&gt;Anlasser&lt;/em&gt; when translating &lt;em&gt;starter&lt;/em&gt; into German:
&lt;img alt="An example of a translation error when translating _starter_ into German" src="https://vamvas.ch/assets/distilled-bias/disambiguation-wsd-example.png" width="415px"&gt;&lt;/p&gt;
&lt;p&gt;Context usually helps with disambiguation. A &lt;em&gt;starter&lt;/em&gt; that is made of avocados is probably not a motor part. But MT systems sometimes ignore this context and make disambiguation errors in spite of it.&lt;/p&gt;
&lt;p&gt;Another interesting example are gendered occupation names. &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Gabriel Stanovsky, Noah&amp;nbsp;A. Smith, and Luke Zettlemoyer.
Evaluating gender bias in machine translation.
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 1679–1684. Florence, Italy, July 2019. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/P19-1164"&gt;https://www.aclweb.org/anthology/P19-1164&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/P19-1164"&gt;doi:10.18653/v1/P19-1164&lt;/a&gt;.' href='#stanovsky-etal-2019-evaluating' id='ref-stanovsky-etal-2019-evaluating-1'&gt;Stanovsky et al. (2019)&lt;/a&gt; have demonstrated how even the best MT systems tend to ignore pronouns when translating occupation names from English. This is a problem because morphologically rich languages such as German have different forms for female and male occupation holders:&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example of a translation error when translating _doctor_ into German" src="https://vamvas.ch/assets/distilled-bias/disambiguation-occupations-example.png" width="450px"&gt;&lt;/p&gt;
&lt;p&gt;What is especially unpleasant about such errors is that they are systematic: MT systems tend to ignore female pronouns more often than male pronouns. This is because the training data contain many more male occupation names, among other factors.
Thus, disambiguation errors in MT are mostly caused by an overgeneralization of the training data.&lt;/p&gt;
&lt;p&gt;Clearly, overgeneralization hurts the adequacy of machine translations. But when it comes to overgeneralization of gender, many people see an ethical problem as well. For example, if you are concerned about the dominance of male forms in human-written German texts (as many feminist linguists are), then you should also be concerned about an even greater dominance of male forms in the output of MT systems.&lt;/p&gt;
&lt;h2&gt;Background: Distillation for MT&lt;/h2&gt;
&lt;p&gt;In our case study we look at a technique called &lt;em&gt;sequence-level knowledge distillation&lt;/em&gt;, since there are reasons to believe that it increases overgeneralization.
This technique, originally proposed by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Yoon Kim and Alexander&amp;nbsp;M. Rush.
Sequence-level knowledge distillation.
In &lt;em&gt;Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 1317–1327. Austin, Texas, November 2016. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/D16-1139"&gt;https://www.aclweb.org/anthology/D16-1139&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/D16-1139"&gt;doi:10.18653/v1/D16-1139&lt;/a&gt;.' href='#kim-rush-2016-sequence' id='ref-kim-rush-2016-sequence-1'&gt;Kim and Rush (2016)&lt;/a&gt;, is often used to reduce the size of an MT model, for example to make it fit on your phone.&lt;/p&gt;
&lt;p&gt;The idea is to use not one but three steps to train the model:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Train a normal MT model (&lt;em&gt;"teacher"&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;Re-translate all the training data with the teacher.&lt;/li&gt;
&lt;li&gt;Train a smaller MT model (&lt;em&gt;"student"&lt;/em&gt;) on those data.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A student model trained like this usually reaches a better BLEU score than if it had been trained on the original data.
This means that there must be a difference between the original and the distilled data that makes training a model easier.
While many researchers have looked for such a difference, this figure derived from a paper by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Chunting Zhou, Jiatao Gu, and Graham Neubig.
Understanding knowledge distillation in non-autoregressive machine translation.
In &lt;em&gt;International Conference on Learning Representations&lt;/em&gt;. 2020.
URL: &lt;a href="https://openreview.net/forum?id=BygFVAEKDH"&gt;https://openreview.net/forum?id=BygFVAEKDH&lt;/a&gt;.' href='#Zhou2020Understanding' id='ref-Zhou2020Understanding-1'&gt;Zhou et al. (2020)&lt;/a&gt; is a good visualization:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Visualization of mode reduction in distilled machine translation" src="https://vamvas.ch/assets/distilled-bias/distillation-mode-reduction.png" width="500px"&gt;&lt;/p&gt;
&lt;p&gt;The distilled translations are more predictable given the source sentences. This means that they follow the structure of the source more closely and contain less human noise.&lt;/p&gt;
&lt;p&gt;Our hypothesis was that the same phenomenon also leads student models to commit more disambiguation errors due to overgeneralization.&lt;/p&gt;
&lt;h2&gt;Distillation leads to Overgeneralization&lt;/h2&gt;
&lt;p&gt;A very simple test method is to count the words in the different strata of the training data during distillation.
For example, to find out how distillation affects translations of &lt;em&gt;doctor&lt;/em&gt;, we had a close look at versions of the &lt;a href="http://www.statmt.org/wmt19/"&gt;WMT19 English–German training data&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Counts of male and female forms in German translations of _doctor_" src="https://vamvas.ch/assets/distilled-bias/distillation-bias-doctor.png" width="425px"&gt;&lt;/p&gt;
&lt;p&gt;In the original training data (created by human translators over decades), &lt;em&gt;doctor&lt;/em&gt; is mostly translated into &lt;span style="
    background: #3155A4;
"&gt;&amp;emsp;&lt;/span&gt; male forms such as &lt;em&gt;Arzt&lt;/em&gt; and rarely into &lt;span style="
    background: #E56849;
"&gt;&amp;emsp;&lt;/span&gt; female forms such as &lt;em&gt;Ärztin&lt;/em&gt;. (The &lt;span style="
    background: #E9EDF0
"&gt;&amp;emsp;&lt;/span&gt; center represents word forms that we could not automatically classify as male or female forms based on grammatical gender.)
The distilled training data produced by the teacher have an even stronger imbalance, and the student, when we fed it the same sources, created translations that overgeneralize even more.&lt;/p&gt;
&lt;p&gt;To see for yourself that this phenomenon is not specific to doctors, have a look at this Sankey diagram, which shows the same trend for 24 different occupations:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Counts of male and female forms in German translations of frequent occupation nouns" src="https://vamvas.ch/assets/distilled-bias/distillation-sankey-diagram.png"&gt;&lt;/p&gt;
&lt;h2&gt;Results on Probing Tasks&lt;/h2&gt;
&lt;p&gt;A limitation of counting words is that there is no guarantee the context contains a clue about the correct translation. 
Many English source sentences might use the word &lt;em&gt;doctor&lt;/em&gt; in an inherently ambiguous sense that would be impossible to disambiguate even for human translators.
In that sense, we have so far only observed a weak form of overgeneralization.&lt;/p&gt;
&lt;p&gt;To verify that distilled MT systems also have a strong overgeneralization bias (that they also tend to ignore the more salient contexts), we performed experiments using two probing datasets for word sense disambiguation: &lt;a href="https://github.com/Helsinki-NLP/MuCoW"&gt;MuCoW&lt;/a&gt; &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Alessandro Raganato, Yves Scherrer, and J&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;rg Tiedemann.
The &lt;span class="bibtex-protected"&gt;M&lt;/span&gt;u&lt;span class="bibtex-protected"&gt;C&lt;/span&gt;o&lt;span class="bibtex-protected"&gt;W&lt;/span&gt; test suite at &lt;span class="bibtex-protected"&gt;WMT&lt;/span&gt; 2019: automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation.
In &lt;em&gt;Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)&lt;/em&gt;, 470–480. Florence, Italy, August 2019. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/W19-5354"&gt;https://www.aclweb.org/anthology/W19-5354&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/W19-5354"&gt;doi:10.18653/v1/W19-5354&lt;/a&gt;.' href='#raganato-etal-2019-mucow' id='ref-raganato-etal-2019-mucow-1'&gt;(Raganato et al., 2019)&lt;/a&gt; and &lt;a href="https://github.com/gabrielStanovsky/mt_gender"&gt;WinoMT&lt;/a&gt; &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Gabriel Stanovsky, Noah&amp;nbsp;A. Smith, and Luke Zettlemoyer.
Evaluating gender bias in machine translation.
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 1679–1684. Florence, Italy, July 2019. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/P19-1164"&gt;https://www.aclweb.org/anthology/P19-1164&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/P19-1164"&gt;doi:10.18653/v1/P19-1164&lt;/a&gt;.' href='#stanovsky-etal-2019-evaluating' id='ref-stanovsky-etal-2019-evaluating-2'&gt;(Stanovsky et al., 2019)&lt;/a&gt;, and two language pairs (English–German and English–Russian).
On top of the two datasets, we used a novel evaluation protocol called &lt;a href="/evaluating-black-box-mt-with-contrastive-conditioning"&gt;&lt;em&gt;contrastive conditioning&lt;/em&gt;&lt;/a&gt; that allowed us to judge the 1-best translations of our models with a high recall.&lt;/p&gt;
&lt;p&gt;Below I have included one of four graphs from the results (the others look similar and can be found in the &lt;a href="https://aclanthology.org/2021.emnlp-main.803/"&gt;paper&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Graph of main results showing that distilled models have more disambiguation bias when controlled for BLEU" src="https://vamvas.ch/assets/distilled-bias/distilled-bias-wsd-results.png" width="400px"&gt;
&lt;em&gt;Accuracies of various English–German models, &lt;span style="
    background: #B1CB12;
"&gt;&amp;emsp;&lt;/span&gt; distilled and &lt;span style="
    background: #04836E;
"&gt;&amp;emsp;&lt;/span&gt; non-distilled, on word sense disambiguation.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Our probing results confirm that distilled MT systems indeed have a higher overgeneralization bias than comparable non-distilled models, even if we control for BLEU.
Another – expected – implication of the results is that BLEU scores do not capture overgeneralization bias reliably.
It seems that to trace the effects of distillation in MT, targeted evaluation is needed as much as ever, and we hope that contrastive conditioning can contribute to it.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The work presented in this post is joint work with my PhD supervisor &lt;a href="https://www.cl.uzh.ch/de/people/team/compling/sennrich.html"&gt;Rico Sennrich&lt;/a&gt;. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='kim-rush-2016-sequence'&gt;Yoon Kim and Alexander&amp;nbsp;M. Rush.
Sequence-level knowledge distillation.
In &lt;em&gt;Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 1317–1327. Austin, Texas, November 2016. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/D16-1139"&gt;https://www.aclweb.org/anthology/D16-1139&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/D16-1139"&gt;doi:10.18653/v1/D16-1139&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-kim-rush-2016-sequence-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='raganato-etal-2019-mucow'&gt;Alessandro Raganato, Yves Scherrer, and J&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;rg Tiedemann.
The &lt;span class="bibtex-protected"&gt;M&lt;/span&gt;u&lt;span class="bibtex-protected"&gt;C&lt;/span&gt;o&lt;span class="bibtex-protected"&gt;W&lt;/span&gt; test suite at &lt;span class="bibtex-protected"&gt;WMT&lt;/span&gt; 2019: automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation.
In &lt;em&gt;Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)&lt;/em&gt;, 470–480. Florence, Italy, August 2019. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/W19-5354"&gt;https://www.aclweb.org/anthology/W19-5354&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/W19-5354"&gt;doi:10.18653/v1/W19-5354&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-raganato-etal-2019-mucow-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='stanovsky-etal-2019-evaluating'&gt;Gabriel Stanovsky, Noah&amp;nbsp;A. Smith, and Luke Zettlemoyer.
Evaluating gender bias in machine translation.
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 1679–1684. Florence, Italy, July 2019. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/P19-1164"&gt;https://www.aclweb.org/anthology/P19-1164&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/P19-1164"&gt;doi:10.18653/v1/P19-1164&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-stanovsky-etal-2019-evaluating-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-stanovsky-etal-2019-evaluating-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-stanovsky-etal-2019-evaluating-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id='Zhou2020Understanding'&gt;Chunting Zhou, Jiatao Gu, and Graham Neubig.
Understanding knowledge distillation in non-autoregressive machine translation.
In &lt;em&gt;International Conference on Learning Representations&lt;/em&gt;. 2020.
URL: &lt;a href="https://openreview.net/forum?id=BygFVAEKDH"&gt;https://openreview.net/forum?id=BygFVAEKDH&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-Zhou2020Understanding-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>Evaluating Black-Box MT with Contrastive Conditioning</title><link href="https://vamvas.ch/evaluating-black-box-mt-with-contrastive-conditioning" rel="alternate"></link><published>2021-08-28T00:00:00+02:00</published><updated>2021-08-28T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2021-08-28:/evaluating-black-box-mt-with-contrastive-conditioning</id><summary type="html">&lt;p&gt;Why not use contrastive sources instead of contrastive translations?&lt;/p&gt;</summary><content type="html">&lt;p&gt;In a &lt;a href="https://aclanthology.org/2021.emnlp-main.803/"&gt;paper presented at EMNLP 2021&lt;/a&gt;, we propose a new technique for evaluating disambiguation in machine translation called &lt;strong&gt;&lt;em&gt;contrastive conditioning&lt;/em&gt;&lt;/strong&gt;. This post is an introduction to the core ideas behind the technique.&lt;/p&gt;
&lt;h2&gt;White-Box vs. Black-Box MT&lt;/h2&gt;
&lt;p&gt;Machine translation (MT) comes with different user interfaces. Systems created in a research lab are &lt;em&gt;white-box&lt;/em&gt; systems, which means that the code and the model weights are available to the researchers. On the other hand, commercial MT systems such as Google Translate are &lt;em&gt;black boxes&lt;/em&gt; to most users. Independent researchers can only observe what goes in (the source sentence) and what comes out (the machine translation).&lt;/p&gt;
&lt;p&gt;As MT systems are getting better and better, &lt;em&gt;targeted evaluation&lt;/em&gt; has become more important.
MT researchers analyze the performance of systems regarding specific linguistic phenomena, using carefully crafted test data.
Scaling such a targeted evaluation is easier if you can peek into the model internals.&lt;/p&gt;
&lt;h2&gt;Contrastive Evaluation&lt;/h2&gt;
&lt;p&gt;A perfect example is the &lt;em&gt;contrastive evaluation&lt;/em&gt; technique &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Rico Sennrich.
How grammatical is character-level neural machine translation? assessing &lt;span class="bibtex-protected"&gt;MT&lt;/span&gt; quality with contrastive translation pairs.
In &lt;em&gt;Proceedings of the 15th Conference of the &lt;span class="bibtex-protected"&gt;E&lt;/span&gt;uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers&lt;/em&gt;, 376–382. Valencia, Spain, April 2017. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/E17-2060"&gt;https://www.aclweb.org/anthology/E17-2060&lt;/a&gt;.' href='#sennrich-2017-grammatical' id='ref-sennrich-2017-grammatical-1'&gt;(Sennrich, 2017)&lt;/a&gt;.
There, the idea is to suggest two different translation variants to the model. The first variant is a correct translation and the second one contains the error type you're interested in:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example for contrastive evaluation of a machine translation system" src="https://vamvas.ch/assets/contrastive-conditioning/contrastive-evaluation.png" width="300px"&gt;
&lt;em&gt;Example for contrastive evaluation of an MT system (English–German). In this post, I am using the error type "wrong disambiguation of &lt;em&gt;Turkey&lt;/em&gt;" as an example, since when translating from English into German, you need to choose between &lt;em&gt;Türkei&lt;/em&gt; (the country) and &lt;em&gt;Truthahn&lt;/em&gt; (the bird). This is just an illustrating example, not an error that a modern MT system is likely to make.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;One can expect that a good model assigns a higher score to the correct translation, like in the example above.
However, such &lt;em&gt;scores&lt;/em&gt; (probability estimates of a translation given a source sentence) can only be computed with white-box systems. You cannot apply contrastive evaluation to Google Translate and similar services.&lt;/p&gt;
&lt;h2&gt;Pattern-Matching Evaluation&lt;/h2&gt;
&lt;p&gt;To also subject black-box MT to targeted evaluation, researchers have come up with pattern-matching approaches that automatically analyze the translation output.
For example, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Alessandro Raganato, Yves Scherrer, and J&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;rg Tiedemann.
An evaluation benchmark for testing the word sense disambiguation capabilities of machine translation systems.
In &lt;em&gt;Proceedings of the 12th Language Resources and Evaluation Conference&lt;/em&gt;, 3668–3675. Marseille, France, May 2020. European Language Resources Association.
URL: &lt;a href="https://www.aclweb.org/anthology/2020.lrec-1.452"&gt;https://www.aclweb.org/anthology/2020.lrec-1.452&lt;/a&gt;.' href='#raganato-etal-2020-evaluation' id='ref-raganato-etal-2020-evaluation-1'&gt;Raganato et al. (2020)&lt;/a&gt; search the translation for different BabelNet senses, which in our example allows them to check whether &lt;em&gt;Turkey&lt;/em&gt; is translated correctly into German:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example for pattern-matching evaluation of a machine translation system" src="https://vamvas.ch/assets/contrastive-conditioning/pattern-matching.png" width="350px"&gt;
&lt;em&gt;Example for a pattern-matching evaluation (English–German)&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;However, there are always going to be a few translations (let's say 20%) that do not match any of the patterns.
Natural language has many ways of expressing a concept, and it is difficult to enumerate them all. 
Another drawback, which also applies to contrastive evaluation, is that you need to prepare data for every target language of interest.&lt;/p&gt;
&lt;p&gt;What we find exciting about the new method is that it promises, at least for phenomena such as lexical disambiguation, to combine the best of both worlds: The recall of contrastive evaluation and the black-box accessibility of pattern-matching.&lt;/p&gt;
&lt;h2&gt;Inspiration for Contrastive Conditioning&lt;/h2&gt;
&lt;p&gt;The basic idea is that by varying the source sequence to provide more disambiguation cues, you can learn about the ground truth without speaking the target language. For example, let's begin with our original, slightly ambiguous &lt;em&gt;Turkey&lt;/em&gt; sentence:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot from Google Translate (&amp;quot;This was made in Turkey&amp;quot; -&amp;gt; &amp;quot;Dies wurde in der Türkei hergestellt&amp;quot;)" class="box-shadow" src="https://vamvas.ch/assets/contrastive-conditioning/contrastive-conditioning-original.png" width="300px,"&gt;&lt;/p&gt;
&lt;p&gt;If you replace &lt;em&gt;Turkey&lt;/em&gt; with &lt;em&gt;modern Turkey&lt;/em&gt;, you give Google Translate an additional cue that &lt;em&gt;Turkey&lt;/em&gt; is supposed to be a country, not a bird. Given such a cue, Google Translate is less likely to make a disambiguation error:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot from Google Translate (&amp;quot;This was made in modern Turkey&amp;quot; -&amp;gt; &amp;quot;Dies wurde in der modernen Türkei gemacht&amp;quot;)" class="box-shadow" src="https://vamvas.ch/assets/contrastive-conditioning/contrastive-conditioning-correct.png" width="300px,"&gt;&lt;/p&gt;
&lt;p&gt;Since Google Translate has not stopped using the German word &lt;em&gt;Türkei&lt;/em&gt;, the original sentence seems to have been translated correctly.
You can verify this by replacing &lt;em&gt;modern Turkey&lt;/em&gt; with &lt;em&gt;frozen Turkey&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot from Google Translate (&amp;quot;This was made in frozen Turkey&amp;quot; -&amp;gt; &amp;quot;Dies wurde in gefrorener Truthahn hergestellt&amp;quot;)" class="box-shadow" src="https://vamvas.ch/assets/contrastive-conditioning/contrastive-conditioning-incorrect.png" width="300px,"&gt;&lt;/p&gt;
&lt;p&gt;The word &lt;em&gt;Türkei&lt;/em&gt; has now disappeared from the translation, which is further proof that &lt;em&gt;Türkei&lt;/em&gt; refers to a country, and not to a bird.&lt;/p&gt;
&lt;h2&gt;Formalization&lt;/h2&gt;
&lt;p&gt;Countless power users of MT have probably already come up with this algorithm. 
The challenge is to apply it to targeted evaluation in a systematic and formalized way.&lt;/p&gt;
&lt;p&gt;Our approach is based on scoring, like in contrastive evaluation. However, what we score is not a human-crafted translation, but the translation by the MT system.
We compute two scores for the translation: one score based on a correct disambiguation cue, and another score based on an incorrect disambiguation cue.
If the first score is higher than the second score, the translation is probably correct:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example for contrastive conditioning" src="https://vamvas.ch/assets/contrastive-conditioning/contrastive-conditioning-example.png" width="450px"&gt;&lt;/p&gt;
&lt;p&gt;To make things more clear, let's do a comparison to the standard approach of contrastive evaluation:
In the standard approach, two contrastive variants of the translation are scored, conditioned on the same source sequence.
In our approach (&lt;em&gt;contrastive conditioning&lt;/em&gt;), the same translation is scored twice, conditioned on contrastive variants on the source.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Schematic visualization of the contrastive conditioning algorithm" src="https://vamvas.ch/assets/contrastive-conditioning/contrastive-conditioning-schema.png"&gt;&lt;/p&gt;
&lt;p&gt;That may not seem like a big difference. But crucially, our method does not depend on the &lt;span style="background:#D4D9ED"&gt;evaluated&amp;nbsp;model&lt;/span&gt; to produce the scores.
It only requires a translation output, and then any MT system can play the role of an &lt;span style="background:#FFD5D5"&gt;evaluator&amp;nbsp;model&lt;/span&gt;.
If you are evaluating a black box you need an additional white-box MT system for that (for example from the &lt;a href="https://opus.nlpl.eu/Opus-MT/"&gt;OPUS-MT&lt;/a&gt; project).&lt;/p&gt;
&lt;h2&gt;Other Benefits&lt;/h2&gt;
&lt;p&gt;In addition to its black-box nature, contrastive conditioning has a few other interesting properties.
For example, since the ratio between the two scores can be seen as the &lt;em&gt;confidence&lt;/em&gt; of the evaluator, it is possible to weigh the test samples by this confidence.
This is helpful if some test samples are difficult to judge.&lt;/p&gt;
&lt;p&gt;Another advantage is that the test data can be reference-free. All the contrastive modifications happen in the source language.
In that sense, contrastive conditioning is easy to scale across many target languages.&lt;/p&gt;
&lt;h2&gt;Outlook&lt;/h2&gt;
&lt;p&gt;It is an open question how many linguistic phenomena are amenable to contrastive conditioning.
In &lt;a href="https://aclanthology.org/2021.emnlp-main.803/"&gt;our paper&lt;/a&gt;, we perform a successful case study on two well-known challenges for MT: Word sense disambiguation and the translation of gendered occupation names into morphologically rich languages.
I am going to discuss our findings in a &lt;a href="/when-mt-distillation-leads-to-bias"&gt;follow-up post&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It will be interesting to see whether phenomena beyond disambiguation can also be evaluated using contrastive conditioning.
And in my view, disambiguation remains a great challenge for MT, even though few modern systems are as bad as the one that produced that infamous clothing label:&lt;/p&gt;
&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;🤷‍♂️ German is hard... &lt;a href="https://t.co/Qe81wLhAnQ"&gt;pic.twitter.com/Qe81wLhAnQ&lt;/a&gt;&lt;/p&gt;&amp;mdash; Benedikt Meurer (&amp;#64;bmeurer) &lt;a href="https://twitter.com/bmeurer/status/1107978587657986053?ref_src=twsrc%5Etfw"&gt;March 19, 2019&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The work presented in this post is joint work with my PhD supervisor &lt;a href="https://www.cl.uzh.ch/de/people/team/compling/sennrich.html"&gt;Rico Sennrich&lt;/a&gt;. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='raganato-etal-2020-evaluation'&gt;Alessandro Raganato, Yves Scherrer, and J&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;rg Tiedemann.
An evaluation benchmark for testing the word sense disambiguation capabilities of machine translation systems.
In &lt;em&gt;Proceedings of the 12th Language Resources and Evaluation Conference&lt;/em&gt;, 3668–3675. Marseille, France, May 2020. European Language Resources Association.
URL: &lt;a href="https://www.aclweb.org/anthology/2020.lrec-1.452"&gt;https://www.aclweb.org/anthology/2020.lrec-1.452&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-raganato-etal-2020-evaluation-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='sennrich-2017-grammatical'&gt;Rico Sennrich.
How grammatical is character-level neural machine translation? assessing &lt;span class="bibtex-protected"&gt;MT&lt;/span&gt; quality with contrastive translation pairs.
In &lt;em&gt;Proceedings of the 15th Conference of the &lt;span class="bibtex-protected"&gt;E&lt;/span&gt;uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers&lt;/em&gt;, 376–382. Valencia, Spain, April 2017. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/E17-2060"&gt;https://www.aclweb.org/anthology/E17-2060&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-sennrich-2017-grammatical-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>More General Stance Detection with x-Stance</title><link href="https://vamvas.ch/more-general-stance-detection-with-x-stance" rel="alternate"></link><published>2020-06-11T00:00:00+02:00</published><updated>2020-06-11T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2020-06-11:/more-general-stance-detection-with-x-stance</id><summary type="html">&lt;p&gt;Introducing a dataset for multilingual and multi-target stance detection.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;Update (2020-08-01)&lt;/em&gt;: A video of the conference presentation is now available.&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/kVWkXnR4_Eg?cc_load_policy=1&amp;modestbranding=1" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;hr&gt;
&lt;p&gt;Automated stance detection systems try to detect broad opinions in natural language expressions. This post is an introduction to a new resource for stance detection called &lt;nobr&gt;&lt;a href="https://github.com/ZurichNLP/xstance"&gt;&lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance&lt;/a&gt;&lt;/nobr&gt;.&lt;/p&gt;
&lt;h2&gt;Background: Stance Detection&lt;/h2&gt;
&lt;p&gt;The idea behind stance detection is best explained in an example. This tweet has recently earned &lt;a href="https://en.wikipedia.org/w/index.php?title=List_of_most-liked_tweets&amp;amp;oldid=961828812"&gt;more than 2 million likes&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en" data-dnt="true" data-theme="light"&gt;&lt;p lang="en" dir="ltr"&gt;After stoking the fires of white supremacy and racism your entire presidency, you have the nerve to feign moral superiority before threatening violence? ‘When the looting starts the shooting starts’??? We will vote you out in November. &lt;a href="https://twitter.com/realDonaldTrump?ref_src=twsrc%5Etfw"&gt;&amp;#64;realdonaldtrump&lt;/a&gt;&lt;/p&gt;&amp;mdash; Taylor Swift (&amp;#64;taylorswift13) &lt;a href="https://twitter.com/taylorswift13/status/1266392274549776387?ref_src=twsrc%5Etfw"&gt;May 29, 2020&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;Of course, the tweet does not just constitute a neutral election forecast. It also expresses a &lt;em&gt;negative stance&lt;/em&gt; towards Donald Trump.&lt;/p&gt;
&lt;p&gt;The type of stance detection system we are interested in is a system that takes an &lt;em&gt;input&amp;nbsp;text&lt;/em&gt; and a &lt;em&gt;target&lt;/em&gt; and outputs either &lt;em&gt;favor&lt;/em&gt;, &lt;em&gt;against&lt;/em&gt; or &lt;em&gt;neutral&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Program flow of a typical stance detection system" src="https://vamvas.ch/assets/xstance/stance-detection-program-flow.png" width="400px"&gt;&lt;/p&gt;
&lt;p&gt;Stance detection systems have been made possible by a series of annotated datasets – from a &lt;a href="http://www.saifmohammad.com/WebPages/StanceDataset.htm"&gt;collection of English tweets on Donald Trump and other topics&lt;/a&gt; all the way to a &lt;a href="https://github.com/mirkolai/MultilingualStanceDetection"&gt;collection of Italian tweets on the Italian constitution&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;However, it has been unclear if the systems can generalize well beyond those specific settings. For example, an English system is not necessarily applicable to Italian or French tweets. And if a system trained on the target of Donald Trump does not generalize to future presidents, then the data collection effort might not be very sustainable.&lt;/p&gt;
&lt;p&gt;&lt;img alt="he generalization problem in stance detection" src="https://vamvas.ch/assets/xstance/generalization-problem.png" width="450px"&gt;
&lt;em&gt;The generalization problem in stance detection. The two dimensions of transfer have been studied individually (&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry.
&lt;span class="bibtex-protected"&gt;S&lt;/span&gt;em&lt;span class="bibtex-protected"&gt;E&lt;/span&gt;val-2016 task 6: detecting stance in tweets.
In &lt;em&gt;Proceedings of the 10th International Workshop on Semantic Evaluation (&lt;span class="bibtex-protected"&gt;S&lt;/span&gt;em&lt;span class="bibtex-protected"&gt;E&lt;/span&gt;val-2016)&lt;/em&gt;, 31–41. San Diego, California, June 2016. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/S16-1003"&gt;https://www.aclweb.org/anthology/S16-1003&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/S16-1003"&gt;doi:10.18653/v1/S16-1003&lt;/a&gt;.' href='#mohammad-etal-2016-semeval' id='ref-mohammad-etal-2016-semeval-1'&gt;Mohammad et al. (2016)&lt;/a&gt;, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Mariona Taul&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;, M&amp;nbsp;Ant&lt;span class="bibtex-protected"&gt;ò&lt;/span&gt;nia Mart&lt;span class="bibtex-protected"&gt;í&lt;/span&gt;, Francisco Rangel, Paolo Rosso, Cristina Bosco, and Viviana Patti.
Overview of the task on stance and gender detection in tweets on catalan independence at ibereval 2017.
In &lt;em&gt;2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval 2017&lt;/em&gt;, volume 1881, 157–177. 2017.
URL: &lt;a href="http://ceur-ws.org/Vol-1881/Overview5.pdf"&gt;http://ceur-ws.org/Vol-1881/Overview5.pdf&lt;/a&gt;.' href='#taule2017overview' id='ref-taule2017overview-1'&gt;Taule et al. (2017)&lt;/a&gt;, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Mariona Taul&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;, Francisco Rangel, M&amp;nbsp;Ant&lt;span class="bibtex-protected"&gt;ò&lt;/span&gt;nia Mart&lt;span class="bibtex-protected"&gt;í&lt;/span&gt;, and Paolo Rosso.
Overview of the task on multimodal stance detection in tweets on catalan #1oct referendum.
In &lt;em&gt;3rd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval 2018&lt;/em&gt;, volume 2150, 149–166. 2018.
URL: &lt;a href="http://ceur-ws.org/Vol-2150/overview-Multistance18.pdf"&gt;http://ceur-ws.org/Vol-2150/overview-Multistance18.pdf&lt;/a&gt;.' href='#taule2018overview' id='ref-taule2018overview-1'&gt;Taule et al. (2018)&lt;/a&gt;), but not jointly.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In &lt;a href="https://arxiv.org/abs/2003.08385"&gt;&lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance: A Multilingual Multi-Target Dataset for Stance Detection&lt;/a&gt;, we present a large-scale dataset in 3 languages and on more than 150 political issues. We show that &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance can be used to train a single model on all of those issues.&lt;/p&gt;
&lt;p&gt;We also look at generalization performance: Our models are evaluated both in a held-out language and on held-out targets. We find that if a standard text classification model is used, zero-shot cross-lingual and cross-target transfer is moderately successful.&lt;/p&gt;
&lt;h2&gt;How We Created &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance&lt;/h2&gt;
&lt;p&gt;&lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance contains 67,000 samples, which is an order of magnitude more than what has been common in stance detection. The dataset is relatively large because instead of doing manual annotation, we have extracted the data directly from a political website. The website – &lt;a href="https://smartvote.ch/"&gt;smartvote.ch&lt;/a&gt; – is a voting advice application that is highly popular in Switzerland.&lt;/p&gt;
&lt;p&gt;Electoral candidates who participate in such a voting advice application are asked a range of questions on controversial topics. &lt;em&gt;Should cannabis use be legalized? Should Switzerland strive for a free trade agreement with the United States?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;What makes this website so interesting for stance detection is that the candidates can respond in two ways: On a yes/no scale and in a free-text comment:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example of a candidate's response on Smartvote" src="https://vamvas.ch/assets/xstance/smartvote-example-annotated.png" width="600px"&gt;
&lt;em&gt;A part of a candidate's response on &lt;a href="https://smartvote.ch/"&gt;smartvote.ch&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;While it is not mandatory, candidates often like to write a few sentences in order to justify, explain or differentiate the yes/no answer in their own language. If we now reverse this relation and interpret the yes/no answer as an &lt;em&gt;annotation of the comments&lt;/em&gt;, we receive a supervised data set for stance detection:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Dataset structure of X-Stance" src="https://vamvas.ch/assets/xstance/xstance-structure.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;Such an automatically extracted dataset is likely noisier than manually curated datasets, but we still expect it to be useful for machine learning research. And thanks to the Smartvote team, we can &lt;a href="https://github.com/ZurichNLP/xstance"&gt;make it available&lt;/a&gt; to fellow researchers under the CC BY-NC 4.0 license.&lt;/p&gt;
&lt;p&gt;Apart from stance detection as a supervised task, we believe that the &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance is also a valuable resource for the study of transfer learning.&lt;/p&gt;
&lt;h2&gt;Putting the «&lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;» in &lt;nobr&gt;&lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance&lt;/nobr&gt;: Transfer Learning&lt;/h2&gt;
&lt;p&gt;First of all, &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance is a multilingual dataset: Since different languages are spoken in different parts of Switzerland, candidates have been free to answer in any language, be it German, French or Italian. (To our regret we did not encounter any &lt;a href="https://www.thelocal.ch/20170302/18-interesting-facts-about-switzerlands-fourth-language-romansh"&gt;Romansh&lt;/a&gt; comments).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Languages in the X-Stance dataset" src="https://vamvas.ch/assets/xstance/dataset-languages.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;Secondly, the &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance dataset contains questions on diverse policy issues. We have clustered the questions into 12 broad topics, including 2 held-out topics:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Topics in the X-Stance dataset" src="https://vamvas.ch/assets/xstance/dataset-topics.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;We can now test both for cross-lingual transfer from German and French to Italian, and for cross-target transfer from known topics to previously unseen topics such as healthcare:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Dimensions of generalization in X-Stance" src="https://vamvas.ch/assets/xstance/xstance-generalization.png" width="450px"&gt;&lt;/p&gt;
&lt;p&gt;In our paper we use a standard architecture – &lt;a href="https://github.com/google-research/bert/"&gt;BERT&lt;/a&gt; – to demonstrate how &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance can serve as a benchmark for transfer learning.&lt;/p&gt;
&lt;h2&gt;Adapting BERT to &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance&lt;/h2&gt;
&lt;p&gt;We download the &lt;a href="https://github.com/google-research/bert/blob/master/multilingual.md"&gt;multilingual BERT model&lt;/a&gt;, which has been pre-trained by Google Research in 104 languages. We then fine-tune the model on our German and French training data.&lt;/p&gt;
&lt;p&gt;As we want to train the model jointly on multiple targets (&lt;em&gt;Free Trade, Legality of Cannabis, …&lt;/em&gt;), we need to inform the model about the specific target of every instance. For this we concatenate to each comment the corresponding natural-language question from Smartvote.&lt;/p&gt;
&lt;p&gt;As BERT allows for two input segments, we designate the question as segment A and the comment as segment B. We put a linear classifier on top of BERT and train the model to predict either FAVOR or AGAINST given a question–comment pair:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Input and output of a BERT sequence pair classifier." src="https://vamvas.ch/assets/xstance/bert-for-xstance.png" width="400px"&gt;
&lt;em&gt;Input and output of a BERT sequence pair classifier. Original image by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.
On the cross-lingual transferability of monolingual representations.
&lt;em&gt;arXiv preprint arXiv:1910.11856&lt;/em&gt;, 2019.' href='#artetxe2019cross' id='ref-artetxe2019cross-1'&gt;Artetxe et al. (2019)&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In a supervised setting, which involves previously seen targets and languages, we find that BERT can clearly surpass a simple majority-class baseline:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cross-lingual results chart" src="https://vamvas.ch/assets/xstance/cross-lingual-results.png" width="600px"&gt;&lt;/p&gt;
&lt;p&gt;In the cross-lingual setting, the zero-shot performance in Italian is much better than the baseline. But the performance is higher in German and French, because the model has been trained on samples in those languages.&lt;/p&gt;
&lt;p&gt;Finally, the model can also generalize to held-out topics. If the model is asked to detect the stance of a text towards a previously unseen target, it performs better than a global majority-class baseline:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cross-target results chart" src="https://vamvas.ch/assets/xstance/cross-target-results.png" width="600px"&gt;&lt;/p&gt;
&lt;h2&gt;Bringing it All Together&lt;/h2&gt;
&lt;p&gt;Given the &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance dataset for stance detection, a multilingual BERT model has some capability to perform zero-shot transfer to unseen languages and to unseen targets. However, there is a gap in performance between the supervised settings and the zero-shot settings that future work could address. For example, even better representations or a more sophisticated classification architecture could be used.&lt;/p&gt;
&lt;h2&gt;Learn More about &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Our &lt;a href="https://github.com/ZurichNLP/xstance"&gt;GitHub repository&lt;/a&gt; contains the full dataset as well as the evaluation script and the code for our baselines.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2003.08385"&gt;The &lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance paper&lt;/a&gt; has been presented at the &lt;a href="https://swisstext-and-konvens-2020.org/"&gt;5th SwissText &amp;amp; 16th KONVENS Conference 2020&lt;/a&gt;. In the paper you will find more details, and also references to related work, which have mostly been omitted in this blog post.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;span style="font-variant:small-caps;"&gt;x&lt;/span&gt;-Stance is also part of the &lt;a href="https://github.com/huggingface/datasets"&gt;datasets&lt;/a&gt; library from Huggingface. Use the &lt;a href="https://huggingface.co/nlp/viewer/?dataset=x_stance"&gt;live viewer&lt;/a&gt; to have an interactive look at the dataset.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The work presented in this post is joint work with my PhD supervisor &lt;a href="https://www.cl.uzh.ch/de/people/team/compling/sennrich.html"&gt;Rico Sennrich&lt;/a&gt;. It was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727).&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='artetxe2019cross'&gt;Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.
On the cross-lingual transferability of monolingual representations.
&lt;em&gt;arXiv preprint arXiv:1910.11856&lt;/em&gt;, 2019. &lt;a class="cite-backref" href="#ref-artetxe2019cross-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='mohammad-etal-2016-semeval'&gt;Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry.
&lt;span class="bibtex-protected"&gt;S&lt;/span&gt;em&lt;span class="bibtex-protected"&gt;E&lt;/span&gt;val-2016 task 6: detecting stance in tweets.
In &lt;em&gt;Proceedings of the 10th International Workshop on Semantic Evaluation (&lt;span class="bibtex-protected"&gt;S&lt;/span&gt;em&lt;span class="bibtex-protected"&gt;E&lt;/span&gt;val-2016)&lt;/em&gt;, 31–41. San Diego, California, June 2016. Association for Computational Linguistics.
URL: &lt;a href="https://www.aclweb.org/anthology/S16-1003"&gt;https://www.aclweb.org/anthology/S16-1003&lt;/a&gt;, &lt;a href="https://doi.org/10.18653/v1/S16-1003"&gt;doi:10.18653/v1/S16-1003&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-mohammad-etal-2016-semeval-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='taule2017overview'&gt;Mariona Taul&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;, M&amp;nbsp;Ant&lt;span class="bibtex-protected"&gt;ò&lt;/span&gt;nia Mart&lt;span class="bibtex-protected"&gt;í&lt;/span&gt;, Francisco Rangel, Paolo Rosso, Cristina Bosco, and Viviana Patti.
Overview of the task on stance and gender detection in tweets on catalan independence at ibereval 2017.
In &lt;em&gt;2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval 2017&lt;/em&gt;, volume 1881, 157–177. 2017.
URL: &lt;a href="http://ceur-ws.org/Vol-1881/Overview5.pdf"&gt;http://ceur-ws.org/Vol-1881/Overview5.pdf&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-taule2017overview-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='taule2018overview'&gt;Mariona Taul&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;, Francisco Rangel, M&amp;nbsp;Ant&lt;span class="bibtex-protected"&gt;ò&lt;/span&gt;nia Mart&lt;span class="bibtex-protected"&gt;í&lt;/span&gt;, and Paolo Rosso.
Overview of the task on multimodal stance detection in tweets on catalan #1oct referendum.
In &lt;em&gt;3rd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval 2018&lt;/em&gt;, volume 2150, 149–166. 2018.
URL: &lt;a href="http://ceur-ws.org/Vol-2150/overview-Multistance18.pdf"&gt;http://ceur-ws.org/Vol-2150/overview-Multistance18.pdf&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-taule2018overview-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>BERT for NER</title><link href="https://vamvas.ch/bert-for-ner" rel="alternate"></link><published>2019-06-19T00:00:00+02:00</published><updated>2019-06-19T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2019-06-19:/bert-for-ner</id><summary type="html">&lt;p&gt;How to apply BERT to the task of named entity recognition.&lt;/p&gt;</summary><content type="html">&lt;p&gt;BERT models, when fine-tuned on &lt;strong&gt;Named Entity Recognition (NER)&lt;/strong&gt;, can have a very competitive performance for the English language. This is an overview of how BERT is designed and how it can be applied to the task of NER. In the last section, I will discuss a cross-lingual scenario.&lt;/p&gt;
&lt;p&gt;In this post, I will assume a basic familiarity with the NER task. When I talk about implementation details of BERT &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: pre-training of deep bidirectional transformers for language understanding.
In &lt;em&gt;Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)&lt;/em&gt;, 4171–4186. 2019.' href='#devlin2019bert' id='ref-devlin2019bert-1'&gt;(Devlin et al., 2019)&lt;/a&gt;, I am referring to the &lt;a href="https://github.com/huggingface/pytorch-pretrained-BERT"&gt;PyTorch version&lt;/a&gt; that was open-sourced by Hugging Face. I have not checked if it completely matches the original implementation with respect to those details.&lt;/p&gt;
&lt;p&gt;First let us look at what goes into the BERT model, because it is rather important for getting NER right.&lt;/p&gt;
&lt;h2&gt;Preprocessing&lt;/h2&gt;
&lt;p&gt;The input to BERT is preprocessed using &lt;strong&gt;WordPiece&lt;/strong&gt; tokenization &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Melvin Johnson, Mike Schuster, Quoc&amp;nbsp;V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vi&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;gas, Martin Wattenberg, Greg Corrado, and others.
Google’s multilingual neural machine translation system: enabling zero-shot translation.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 5:339–351, 2017.' href='#johnson2017google' id='ref-johnson2017google-1'&gt;(Johnson et al., 2017)&lt;/a&gt;, which is a technique comparable to &lt;strong&gt;Byte Pair Encoding&lt;/strong&gt; &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Rico Sennrich, Barry Haddow, and Alexandra Birch.
Neural machine translation of rare words with subword units.
In &lt;em&gt;Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)&lt;/em&gt;, 1715–1725. 2016.' href='#sennrich2016neural' id='ref-sennrich2016neural-1'&gt;(Sennrich et al., 2016)&lt;/a&gt;. The vocabulary is trained on the pre-training data, then re-used for the fine-tuning without any modifications. For the English language model, a vocabulary of 30k tokens is used, and for the multilingual model, 110k tokens.&lt;/p&gt;
&lt;p&gt;It is important to note that NER datasets like &lt;strong&gt;CoNLL-2003&lt;/strong&gt; are already tokenized, as the gold annotation is provided token by token. For training and evaluating BERT, the WordPiece tokenizer strictly deepens the preexisting tokenization. For every original token, the WordPiece tokenization can take one of two forms:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;One-to-one tokenization.&lt;/strong&gt; The token is in the vocabulary. In this case, it is represented by a single WordPiece.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;One-to-many tokenization&lt;/strong&gt;. The token is not in the vocabulary; in this case, the WordPiece tokenizer will split the token into a sequence of vocabulary items using a greedy longest-match approach. I call the first WordPiece of such a split the &lt;strong&gt;head &lt;/strong&gt;and the other WordPieces the &lt;strong&gt;tails&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Tails are marked with a leading “##” by the WordPiece tokenizer. It follows that the vocabulary is divided into two distinct sets: WordPieces that can occur both in a &lt;strong&gt;single&lt;/strong&gt; or in a &lt;strong&gt;head&lt;/strong&gt; role on the one hand (no leading “##”), and &lt;strong&gt;tail&lt;/strong&gt; WordPieces on the other hand (with a leading “##”).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example sentence illustrating one-to-one tokenization and one-to-many tokenization. In this example, “Leicester” is what I call a &amp;quot;head&amp;quot; WordPiece, and “##shire” is a &amp;quot;tail&amp;quot; WordPiece." src="https://vamvas.ch/assets/bert-for-ner/tokenizer.png" width="500px"&gt;
&lt;em&gt;Example sentence illustrating one-to-one tokenization and one-to-many tokenization. In this example, “Leicester” is what I call a "head" WordPiece, and “##shire” is a "tail" WordPiece.​.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Examples for vocabulary entries of bert-large-cased that can occur as heads or single token (left) and entries that can only occur as tails (right)" src="https://vamvas.ch/assets/bert-for-ner/vocabulary.png" width="300px"&gt;
&lt;em&gt;Examples for vocabulary entries of bert-large-cased that can occur as heads or single token (left) and entries that can only occur as tails (right).&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;WordPiece embeddings are only one part of the input to BERT. The full input is a sum of three kinds of embeddings, each with a size of 768 for BERT-Base (or 1024 for BERT-Large):&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;WordPiece embeddings&lt;/strong&gt;, which like the other embeddings are trained from scratch and stay trainable during the fine-tuning step.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Segment embeddings.&lt;/strong&gt; These kinds of segments are only relevant for tasks where a pair of sentences is classified. For NER, the embedding is insignificant, and the same segment (“A”) can be used for all tokens.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Position embeddings&lt;/strong&gt;, which compensate for the non-recurrence of the transformer layers.&lt;/p&gt;
&lt;p&gt;During pre-training, the input has a maximum length of 512 WordPieces. When BERT is finetuned on NER, using such long sequences is unnecessary, given that NER is usually done sentence by sentence. A sequence length of 150 should be enough, as the sentences in the English CoNLL-2003 validation set have 14.5 (original) tokens on average, and the longest sentence has 109 tokens. An alternative to having long enough sequences is a sliding window approach as described by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Shijie Wu and Mark Dredze.
Beto, bentz, becas: the surprising cross-lingual effectiveness of bert.
In &lt;em&gt;Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)&lt;/em&gt;, 833–844. 2019.' href='#wu2019beto' id='ref-wu2019beto-1'&gt;Wu and Dredze (2019)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;BERT also expects that each sentence starts with a [CLS] token and ends with a [SEP] token. These special tokens are not particularly relevant for the NER task, considering that classification is done token-wise and the special tokens have no associated tag. Nevertheless, they should be included so that the fine-tuning input is not too different from the pre-training input.&lt;/p&gt;
&lt;p&gt;For most of the tasks BERT was evaluated on, a model with a lowercase vocabulary was used. For NER, however, the cased variant should be used and no lowercasing should be performed during preprocessing.&lt;/p&gt;
&lt;h2&gt;Architecture&lt;/h2&gt;
&lt;p&gt;BERT is a stack of &lt;strong&gt;Transformer&lt;/strong&gt; layers &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan&amp;nbsp;N Gomez, &lt;span class="bibtex-protected"&gt;Ł&lt;/span&gt;ukasz Kaiser, and Illia Polosukhin.
Attention is all you need.
In &lt;em&gt;Advances in neural information processing systems&lt;/em&gt;, 5998–6008. 2017.' href='#vaswani2017attention' id='ref-vaswani2017attention-1'&gt;(Vaswani et al., 2017)&lt;/a&gt;. The variant of size “Base” has 12 layers, 12 self-attention heads, and a hidden state size of 768 per token. In total, BERT-Base has 110 million trainable parameters. The BERT-Large variant has 24 layers, 16 self-attention heads and a hidden size of 1024, which amounts to 340 million parameters.&lt;/p&gt;
&lt;p&gt;The Transformer implements some innovative ideas which are highly relevant for the NER task:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Self-attention.&lt;/strong&gt; Self-attention here is the idea to encode a token as the weighted sum of its context. The weights are computed as a function of the context token (key) and the token to be encoded (query), and this function has trainable weights. Other than RNN layers, which are also designed to represent tokens in context, the self-attention layer handles the context as a bag of words. This design decision is crucial:On the one hand, the parallelizability of the bag-of-words pattern makes it possible to efficiently pre-train BERT on very large corpora. On the other hand, steps need to be taken to incorporate word order in another way (see &lt;strong&gt;Soft Sequentiality&lt;/strong&gt; below).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multi-head attention.&lt;/strong&gt; When a token is encoded as a weighted sum of its context, this is a rather coarse representation. For this reason, the designers of the Transformer introduced Multi-head attention to increase the representational capacity of the model. Multi-head attention means that the self-attention step is performed multiple times in parallel, with different weight matrices. The output of the attention heads is then concatenated and projected to the size of a single attention head.In theory, this would allow a NER model to learn different attention heads for different classes. In reality, however, the classes are not separated that way, as preliminary investigations have shown.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stacking.&lt;/strong&gt; As is often done with RNNs, too, multiple self-attention layers are stacked. Interspersed with the self-attention layers are token-wise feed-forward layers, and all layers are connected via skip-connections, too. But interpreting differences in the various trained layers has turned out to be difficult. With respect to NER, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Shijie Wu and Mark Dredze.
Beto, bentz, becas: the surprising cross-lingual effectiveness of bert.
In &lt;em&gt;Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)&lt;/em&gt;, 833–844. 2019.' href='#wu2019beto' id='ref-wu2019beto-2'&gt;Wu and Dredze (2019)&lt;/a&gt; have shown that BERT layers to not consistently become more language-independent towards the final layer. Still, the layeredness is hoped to improve the abstractive capacity of the model, which would help to solve a task such as NER requiring complex semantics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Soft sequentiality.&lt;/strong&gt; This is a concept I like to use for how a transformer incorporates the sequentiality of the tokens. In an RNN, the sequence of tokens is hardwired through recurrence. On the other hand, in a transformer, the sequence is a feature of the tokens, which “softens” the sequentiality. The feature is chosen such that the model can generalize well to sequences of arbitrary length (a function of the sine wave).&lt;/p&gt;
&lt;p&gt;&lt;img alt="It goes without saying that word order is crucial for NER, as these made-up headlines illustrate" src="https://vamvas.ch/assets/bert-for-ner/word-order.png" width="440px"&gt;
&lt;em&gt;It goes without saying that word order is crucial for NER, as these made-up headlines illustrate&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Pre-Training&lt;/h2&gt;
&lt;p&gt;The idea behind pre-training is to initialize the model with a general language-modelling capacity that can later be transferred to a multitude of NLP tasks. While for most researchers and practitioners, pre-training BERT is rather expensive, domain-specific pre-training is known to further boost performance &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc&amp;nbsp;V Le.
Unsupervised data augmentation.
&lt;em&gt;arXiv preprint arXiv:1904.12848&lt;/em&gt;, 2019.' href='#xie2019unsupervised' id='ref-xie2019unsupervised-1'&gt;(Xie et al., 2019)&lt;/a&gt;. For completeness, I will summarize the pre-training step:&lt;/p&gt;
&lt;p&gt;Two prediction tasks are used which are entirely self-supervised.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Masked Language Modelling (MLM).&lt;/strong&gt; 15 % of the tokens in a sentence are randomly selected. The selected tokens are “masked” in a randomized procedure; the other tokens are unchanged. BERT learns to predict the original word from the output hidden states corresponding to the masked words, using a softmax layer.The masking procedure is defined as follows: In 80% of the cases, the token is replaced with a special [MASK] token. In 10% of the cases, the token is replaced with a word chosen randomly from the vocabulary. In the remaining 10%, the token is left unchanged but still included in the loss.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Next Sentence Prediction. &lt;/strong&gt;The pre-training input consists of two segments, A and B. In 50% of the cases, the segments form a sequence in the original text. In the other cases, they have been randomly paired. From the last hidden state corresponding to the [CLS] token (sometimes called “pooled model output”), BERT learns to predict whether the B is a “next sentence” to A or not.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/google-research/bert"&gt;originally published English models&lt;/a&gt; have been pre-trained on English Wikipedia, and books. From the two tasks, BERT has learnt representations of both words and sentences, which is a good starting point for fine-tuning.&lt;/p&gt;
&lt;h2&gt;Fine-Tuning for NER&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Illustration of BERT for NER (Devlin et al. 2018)" src="https://vamvas.ch/assets/bert-for-ner/bert-for-ner.png" width="400px"&gt;
&lt;em&gt;Illustration of BERT for NER (Devlin et al. 2018)&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;When BERT is fine-tuned on a task, the pre-trained Transformer functions as an encoder, and a randomly initialized &lt;strong&gt;classifier&lt;/strong&gt; is now added on top. In the case of NER, the classifier is simply a projection from the token hidden state size to the size of the tag set, with a subsequent &lt;strong&gt;softmax&lt;/strong&gt; operation to turn the scores into likelihoods. The token classifier is shared across all positions. Considering that no RNN and no CRF layer is used, the NER classifier is simpler than the classifier proposed by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer.
Neural architectures for named entity recognition.
In &lt;em&gt;Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies&lt;/em&gt;, 260–270. 2016.' href='#lample2016neural' id='ref-lample2016neural-1'&gt;Lample et al. (2016)&lt;/a&gt;, which was used to achieve 93.5 F1 with the BERT “competitor” from Facebook AI Research &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli.
Cloze-driven pretraining of self-attention networks.
In &lt;em&gt;Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)&lt;/em&gt;, 5363–5372. 2019.' href='#baevski2019cloze' id='ref-baevski2019cloze-1'&gt;(Baevski et al., 2019)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To understand the classification step, it is useful to remember how the input has been preprocessed a few sections above. The one-to-many property of the WordPiece tokenizer ensures that for every token in the original dataset, there is at least one individual WordPiece that can be tagged. If an original token is split into multiple WordPieces (one-to-many case), then all WordPieces are tagged by the classifier, but only the head predictions are included in the loss and in the output at runtime. In other words, the head WordPiece serves as a proxy for the full original token. Thus, only the hidden states corresponding to &lt;strong&gt;single&lt;/strong&gt; or &lt;strong&gt;head&lt;/strong&gt; WordPieces are a relevant part of BERT’s &lt;strong&gt;last&lt;/strong&gt; layer, while for the lower levels, all hidden states remain relevant.&lt;/p&gt;
&lt;p&gt;During fine-tuning, both the encoder and the classifier are trained with a small learning rate. Fine-tuning is usually done for a small number of epochs, e.g. 4, and with a two-phased optimization procedure &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan&amp;nbsp;N Gomez, &lt;span class="bibtex-protected"&gt;Ł&lt;/span&gt;ukasz Kaiser, and Illia Polosukhin.
Attention is all you need.
In &lt;em&gt;Advances in neural information processing systems&lt;/em&gt;, 5998–6008. 2017.' href='#vaswani2017attention' id='ref-vaswani2017attention-2'&gt;(Vaswani et al., 2017)&lt;/a&gt;: &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Warmup.&lt;/strong&gt; For a percentage of the fine-tuning steps (default: 0.1), the learning rate is increased from 0 (default: linearly).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Decay&lt;/strong&gt;. For remaining steps, the learning rate is decreased such that it is zero in the end (e.g. linearly).&lt;/p&gt;
&lt;p&gt;In order to achieve the results they have published, the BERT authors have selected the best learning rate out of {5e-5, 3e-5, 2e-5} based on the validation performance.&lt;/p&gt;
&lt;p&gt;They have reported the following results for the English CoNLL-2003 test set:&lt;/p&gt;
&lt;table class="wp-block-table has-fixed-layout is-style-stripes"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Model size&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;English CoNLL-2003 Test F1&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;BERT-Base&lt;/td&gt;&lt;td&gt;92.4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;BERT-Large&lt;/td&gt;&lt;td&gt;92.8&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;p&gt;As an alternative to fine-tuning, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: pre-training of deep bidirectional transformers for language understanding.
In &lt;em&gt;Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)&lt;/em&gt;, 4171–4186. 2019.' href='#devlin2019bert' id='ref-devlin2019bert-2'&gt;Devlin et al. (2019)&lt;/a&gt; report results on a &lt;strong&gt;feature-based approach&lt;/strong&gt;, too. In this case, a two-layer BiLSTM is put on top of the pre-trained BERT, and during training, the BERT parameters remain frozen. Comparing the validation set results of this approach to the validation set results of the fine-tuning approach, we can see that the more expensive fine-tuning brings an improvement for NER, however a small one:&lt;/p&gt;
&lt;table class="wp-block-table has-fixed-layout is-style-stripes"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;English CoNLL-2003 Dev F1&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Fine-tuning approach&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;BERT-Base&lt;/td&gt;&lt;td&gt;96.4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;BERT-Large&lt;/td&gt;&lt;td&gt;96.6&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Feature-based approach&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;WordPiece Embeddings (first layer)&lt;/td&gt;&lt;td&gt;91.0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Second-to-Last Hidden&lt;/td&gt;&lt;td&gt;95.6&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Last Hidden&lt;/td&gt;&lt;td&gt;94.9&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Weighted Sum Last Four Hidden&lt;/td&gt;&lt;td&gt;95.9&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Concat Last Four Hidden&lt;/td&gt;&lt;td&gt;96.1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Weighted Sum All 12 Layers&lt;/td&gt;&lt;td&gt;95.5&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;h2&gt;Cross-Lingual Transfer&lt;/h2&gt;
&lt;p&gt;Cross-lingual NER is a scenario where there are enough data for a &lt;strong&gt;source language&lt;/strong&gt; (usually English), and only little data for a &lt;strong&gt;target language&lt;/strong&gt;. For this post I will look at the most extreme case, where there are only evaluation data, and no training data at all, for the target language. This case is often called &lt;strong&gt;zero-shot transfer&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The team behind BERT has open-sourced a multilingual BERT model (sometimes called &lt;strong&gt;mBERT&lt;/strong&gt;) that allows for experiments in this direction. Right now, the only documentation available is in a &lt;a href="https://github.com/google-research/bert/blob/master/multilingual.md"&gt;README&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;p&gt;mBERT has been trained on 104 Wikipedias in different languages. Generally, the corpora have been treated as if they were from a single language: There is no explicit denotation of the input language. This way, the BERT architecture di not have to be changed and mBERT can now be used with any previously unseen language (e.g. Alemannic, which was not included as it has the 108th largest Wikipedia).&lt;/p&gt;
&lt;p&gt;For the vocabulary creation and the pre-training, relatively small languages were upsampled to a degree. I do not know whether the random replacing of words and the random pairing of sentences was done across languages or not – there would be arguments for and against doing so.&lt;/p&gt;
&lt;p&gt;A systematic study on the cross-lingual effectiveness of mBERT has been conducted by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Shijie Wu and Mark Dredze.
Beto, bentz, becas: the surprising cross-lingual effectiveness of bert.
In &lt;em&gt;Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)&lt;/em&gt;, 833–844. 2019.' href='#wu2019beto' id='ref-wu2019beto-3'&gt;Wu and Dredze (2019)&lt;/a&gt;. Their experiment was set up as follows:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fine-tuning. &lt;/strong&gt;For NER, they first fine-tune bert-base-multilingual-cased on English CoNLL-2003, choosing a combination of learning rate, batch size and number of epochs that has the best performance on the English validation set&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Zero-shot evaluation.&lt;/strong&gt; Then they evaluate the model on the test set of another language (German, Spanish and Dutch). They want to find out whether mBERT can generalize to other languages without having seen target-language examples for the fine-tuning task.&lt;/p&gt;
&lt;p&gt;They report the following results for the CoNLL-2002/2003 languages:&lt;/p&gt;
&lt;table class="wp-block-table has-fixed-layout is-style-stripes"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;EN&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;DE&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;NL&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;ES&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;ZH&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Supervised&lt;/td&gt;&lt;td&gt;91.97&lt;/td&gt;&lt;td&gt;82.82&lt;/td&gt;&lt;td&gt;90.94&lt;/td&gt;&lt;td&gt;87.38&lt;/td&gt;&lt;td&gt;93.17&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Zero-shot SOTA (Xie et al. 2019)&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;57.76&lt;/td&gt;&lt;td&gt;71.25&lt;/td&gt;&lt;td&gt;72.37&lt;/td&gt;&lt;td&gt;–&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Zero-Shot&lt;/td&gt;&lt;td&gt;–&lt;/td&gt;&lt;td&gt;69.56&lt;/td&gt;&lt;td&gt;77.57&lt;/td&gt;&lt;td&gt;74.96&lt;/td&gt;&lt;td&gt;51.90&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;p&gt;Zero-shot transfer is of course not better than supervised learning, but mBERT clearly outperforms the previous zero-shot state of the art by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jiateng Xie, Zhilin Yang, Graham Neubig, Noah&amp;nbsp;A Smith, and Jaime&amp;nbsp;G Carbonell.
Neural cross-lingual named entity recognition with minimal resources.
In &lt;em&gt;Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 369–379. 2018.' href='#xie2018neural' id='ref-xie2018neural-1'&gt;Xie et al. (2018)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Telmo Pires, Eva Schlinger, and Dan Garrette.
How multilingual is multilingual bert?
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 4996–5001. 2019.' href='#pires2019multilingual' id='ref-pires2019multilingual-1'&gt;Pires et al. (2019)&lt;/a&gt; perfomed the same experiment and even provide numbers for all source–target combinations of the CoNLL data:&lt;/p&gt;
&lt;table class="wp-block-table has-fixed-layout is-style-stripes"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;↓ Fine-tuning \ Eval →&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;EN&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;DE&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;NL&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;ES&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;EN&lt;/td&gt;&lt;td&gt;90.70&lt;/td&gt;&lt;td&gt;69.74&lt;/td&gt;&lt;td&gt;77.36&lt;/td&gt;&lt;td&gt;73.59&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;DE&lt;/td&gt;&lt;td&gt;73.83&lt;/td&gt;&lt;td&gt;82.00&lt;/td&gt;&lt;td&gt;76.25&lt;/td&gt;&lt;td&gt;70.03&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;NL&lt;/td&gt;&lt;td&gt;65.46&lt;/td&gt;&lt;td&gt;65.68&lt;/td&gt;&lt;td&gt;89.86&lt;/td&gt;&lt;td&gt;72.10&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ES&lt;/td&gt;&lt;td&gt;65.38&lt;/td&gt;&lt;td&gt;59.40&lt;/td&gt;&lt;td&gt;64.39&lt;/td&gt;&lt;td&gt;87.18&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;p&gt;It is especially interesting that zero-short transfer to Dutch seems to benefit more from a Spanish source than from English or German.&lt;/p&gt;
&lt;p&gt;The slight differences between the two tables can be explained by implementation details and different hyperparameters, considering that &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Telmo Pires, Eva Schlinger, and Dan Garrette.
How multilingual is multilingual bert?
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 4996–5001. 2019.' href='#pires2019multilingual' id='ref-pires2019multilingual-2'&gt;Pires et al. (2019)&lt;/a&gt; did not do any hyperparameter tuning.&lt;/p&gt;
&lt;p&gt;To conclude, BERT is a good basis to achieve state-of-the-art results on Named Entity Recognition, and the fine-tuning for this task is relatively simple to implement. The multilingual BERT model even allows for an easy cross-lingual transfer, but the results here still have room for improvement.&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='baevski2019cloze'&gt;Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli.
Cloze-driven pretraining of self-attention networks.
In &lt;em&gt;Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)&lt;/em&gt;, 5363–5372. 2019. &lt;a class="cite-backref" href="#ref-baevski2019cloze-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='devlin2019bert'&gt;Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: pre-training of deep bidirectional transformers for language understanding.
In &lt;em&gt;Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)&lt;/em&gt;, 4171–4186. 2019. &lt;a class="cite-backref" href="#ref-devlin2019bert-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-devlin2019bert-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-devlin2019bert-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id='johnson2017google'&gt;Melvin Johnson, Mike Schuster, Quoc&amp;nbsp;V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vi&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;gas, Martin Wattenberg, Greg Corrado, and others.
Google’s multilingual neural machine translation system: enabling zero-shot translation.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 5:339–351, 2017. &lt;a class="cite-backref" href="#ref-johnson2017google-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='lample2016neural'&gt;Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer.
Neural architectures for named entity recognition.
In &lt;em&gt;Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies&lt;/em&gt;, 260–270. 2016. &lt;a class="cite-backref" href="#ref-lample2016neural-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='pires2019multilingual'&gt;Telmo Pires, Eva Schlinger, and Dan Garrette.
How multilingual is multilingual bert?
In &lt;em&gt;Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics&lt;/em&gt;, 4996–5001. 2019. &lt;a class="cite-backref" href="#ref-pires2019multilingual-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-pires2019multilingual-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-pires2019multilingual-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id='sennrich2016neural'&gt;Rico Sennrich, Barry Haddow, and Alexandra Birch.
Neural machine translation of rare words with subword units.
In &lt;em&gt;Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)&lt;/em&gt;, 1715–1725. 2016. &lt;a class="cite-backref" href="#ref-sennrich2016neural-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='vaswani2017attention'&gt;Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan&amp;nbsp;N Gomez, &lt;span class="bibtex-protected"&gt;Ł&lt;/span&gt;ukasz Kaiser, and Illia Polosukhin.
Attention is all you need.
In &lt;em&gt;Advances in neural information processing systems&lt;/em&gt;, 5998–6008. 2017. &lt;a class="cite-backref" href="#ref-vaswani2017attention-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-vaswani2017attention-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-vaswani2017attention-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id='wu2019beto'&gt;Shijie Wu and Mark Dredze.
Beto, bentz, becas: the surprising cross-lingual effectiveness of bert.
In &lt;em&gt;Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)&lt;/em&gt;, 833–844. 2019. &lt;a class="cite-backref" href="#ref-wu2019beto-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-wu2019beto-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-wu2019beto-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-wu2019beto-3" title="Jump back to reference 3"&gt;&lt;sup&gt;3&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id='xie2018neural'&gt;Jiateng Xie, Zhilin Yang, Graham Neubig, Noah&amp;nbsp;A Smith, and Jaime&amp;nbsp;G Carbonell.
Neural cross-lingual named entity recognition with minimal resources.
In &lt;em&gt;Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 369–379. 2018. &lt;a class="cite-backref" href="#ref-xie2018neural-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='xie2019unsupervised'&gt;Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc&amp;nbsp;V Le.
Unsupervised data augmentation.
&lt;em&gt;arXiv preprint arXiv:1904.12848&lt;/em&gt;, 2019. &lt;a class="cite-backref" href="#ref-xie2019unsupervised-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry><entry><title>How NLP affects Gender Equality</title><link href="https://vamvas.ch/how-nlp-affects-gender-equality" rel="alternate"></link><published>2019-04-16T00:00:00+02:00</published><updated>2019-04-16T00:00:00+02:00</updated><author><name>Jannis Vamvas</name></author><id>tag:vamvas.ch,2019-04-16:/how-nlp-affects-gender-equality</id><summary type="html">&lt;p&gt;A brief essay discussing problems for gender equality in my field of study.&lt;/p&gt;</summary><content type="html">&lt;p&gt;More and more applications of &lt;em&gt;Natural Language Processing (NLP)&lt;/em&gt; are used in everyday life: When you translate a paragraph online, you are using an application of NLP, and also when you dictate a letter to a speech recognition system or when you ask questions to a voice assistant. The business world, too, is full of hidden but powerful applications of NLP.&lt;/p&gt;
&lt;p&gt;Given this increasing applicability, researchers need to be aware of ethical concepts such as &lt;em&gt;gender equality&lt;/em&gt;. First of all, NLP makes use of gender as an &lt;em&gt;explicit variable&lt;/em&gt; in many places. For example, state-of-the-art systems can infer the gender of a writer with a certain degree of accuracy, based on stylistic features of the text (see &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Dong Nguyen, A&amp;nbsp;Seza Doğru&lt;span class="bibtex-protected"&gt;&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;&lt;/span&gt;z, Carolyn&amp;nbsp;P Ros&lt;span class="bibtex-protected"&gt;&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;&lt;/span&gt;, and Franciska de&amp;nbsp;Jong.
&lt;span class="bibtex-protected"&gt;Computational sociolinguistics: A survey&lt;/span&gt;.
&lt;em&gt;Computational linguistics&lt;/em&gt;, 42(3):537–593, 2016.' href='#Nguyen2016' id='ref-Nguyen2016-1'&gt;Nguyen et al. (2016)&lt;/a&gt; for a critical survey). In an extreme case, this technology could allow employers to circumvent anonymization of job applications.&lt;/p&gt;
&lt;p&gt;Secondly, gender is ubiquitous in NLP systems as an &lt;em&gt;latent&lt;/em&gt; &lt;em&gt;variable&lt;/em&gt;. To give an extreme example, think of a system that evaluates the quality of application letters. Even if the system seems harmless from a user perspective, it still might implicitly consider the applicants’ gender based on stylometric hints. If the company’s employment history has been biased towards men in the past, the system might wrongly infer that female applicants are less qualified in general. Another word for this phenomenon is &lt;em&gt;proxy discrimination&lt;/em&gt;, as recently discussed in &lt;a href="https://nyti.ms/2VBlSn3"&gt;a NYT op-ed article&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Learning to discriminate&lt;/h2&gt;
&lt;p&gt;Let us have a closer look on gender as a latent variable. How can such a variable come into play? In NLP, most training data are derived from speech that humans have uttered somewhere in the past. Traditionally, the data are curated and annotated by experts.&lt;/p&gt;
&lt;p&gt;But researchers like to experiment with unsupervised approaches, feeding to NLP systems gigantic amounts of raw textual data: books, social media posts, newspapers, or Wikipedia articles. Digesting those billions of utterances, systems pick up information about the morphology, syntax and semantics of a language.&lt;/p&gt;
&lt;p&gt;It has been shown that this way, the systems also acquire a concept of gender – and one that is fraught with stereotypes. For example, many systems display the same &lt;em&gt;gender biases&lt;/em&gt; that have been measured in humans: They associate words denoting women (“woman”, “girl”) less with math or science, and more with the arts, than male words &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Aylin Caliskan, Joanna&amp;nbsp;J Bryson, and Arvind Narayanan.
&lt;span class="bibtex-protected"&gt;Semantics derived automatically from language corpora contain human-like biases&lt;/span&gt;.
&lt;em&gt;Science&lt;/em&gt;, 356(6334):183–186, 2017.' href='#Caliskan2017' id='ref-Caliskan2017-1'&gt;(Caliskan et al., 2017)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Correlation between association of word embeddings and actual employment numbers" src="https://vamvas.ch/assets/gender-equality/occupation-gender.png"&gt;
&lt;em&gt;Correlation between association of word embeddings, and actual employment numbers as shown by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Aylin Caliskan, Joanna&amp;nbsp;J Bryson, and Arvind Narayanan.
&lt;span class="bibtex-protected"&gt;Semantics derived automatically from language corpora contain human-like biases&lt;/span&gt;.
&lt;em&gt;Science&lt;/em&gt;, 356(6334):183–186, 2017.' href='#Caliskan2017' id='ref-Caliskan2017-2'&gt;Caliskan et al. (2017)&lt;/a&gt;​.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;One could argue that this kind of latent gender bias is not an issue of NLP itself. On this account, the systems reflect society as it is, with all its faults. For example, the association of certain occupations with the feminine gender could be seen as an inherent property of natural language.&lt;/p&gt;
&lt;p&gt;In the same way of thinking, a latent gender bias could be attributed to the annotators of the data, who are known to impress their own biases on the training data. To give an example, annotators asked to caption images of people tend to overspecify the gender: They would label a snowboarder as a man even if the face is not visible, and in consequence, image captioning systems learn to associate snowboards with men &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Lisa&amp;nbsp;Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach.
&lt;span class="bibtex-protected"&gt;Women also snowboard: Overcoming bias in captioning models&lt;/span&gt;.
In &lt;em&gt;European Conference on Computer Vision&lt;/em&gt;, 793–811. 2018.' href='#Hendricks2018' id='ref-Hendricks2018-1'&gt;(Hendricks et al., 2018)&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;Finally, the bias could be attributed to the end-users and their preferences. For example, the fact that all major developers of voice assistants have chosen a female voice for their product is &lt;a href="https://www.npr.org/2018/07/09/627266501/the-push-for-a-gender-neutral-siri"&gt;usually justified with customer expectations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Comparison of how different versions of a system caption the image of a snowboarder seen from behind (Hendricks et al. 2018)" src="https://vamvas.ch/assets/gender-equality/snowboarder.png"&gt;
&lt;em&gt;Comparison of how different versions of a system caption the image of a snowboarder seen from behind &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Lisa&amp;nbsp;Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach.
&lt;span class="bibtex-protected"&gt;Women also snowboard: Overcoming bias in captioning models&lt;/span&gt;.
In &lt;em&gt;European Conference on Computer Vision&lt;/em&gt;, 793–811. 2018.' href='#Hendricks2018' id='ref-Hendricks2018-2'&gt;(Hendricks et al., 2018)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;But this argument overlooks that the elements of an NLP system have been intentionally and actively composed by people. In addition, bias does not always stem from outside, but can also &lt;em&gt;emerge&lt;/em&gt; from the system itself &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Batya Friedman and Helen Nissenbaum.
&lt;span class="bibtex-protected"&gt;Bias in computer systems&lt;/span&gt;.
&lt;em&gt;ACM Transactions on Information Systems (TOIS)&lt;/em&gt;, 14(3):330–347, 1996.' href='#Friedman1996' id='ref-Friedman1996-1'&gt;(Friedman and Nissenbaum, 1996)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A military speech recognition system may be developed with mostly male soldiers in mind, which in itself many people would find ethically acceptable. But if the system would later be marketed to a general public without modifications, and would have a worse performance for female voices, a new kind of gender bias had emerged.&lt;/p&gt;
&lt;p&gt;This scenario is not far-fetched, as &lt;a href="https://catalog.ldc.upenn.edu/LDC93S1"&gt;one of the most popular English speech corpora&lt;/a&gt; has been commissioned by the U.S. military 30 years ago. In fact, the imbalance of this dataset may partly explain why the YouTube captioning system works better for male voices, or a least did so two years ago &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Rachael Tatman.
&lt;span class="bibtex-protected"&gt;Gender and Dialect Bias in YouTube&amp;#39;s Automatic Captions&lt;/span&gt;.
In &lt;em&gt;Proceedings of the First ACL Workshop on Ethics in Natural Language Processing&lt;/em&gt;, 53–59. 2017.' href='#Tatman2017' id='ref-Tatman2017-1'&gt;(Tatman, 2017)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;An approach to avoid accidental bias in NLP systems has been proposed by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Emily&amp;nbsp;M Bender and Batya Friedman.
Data statements for natural language processing: toward mitigating system bias and enabling better science.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 6:587–604, 2018.' href='#Bender2018' id='ref-Bender2018-1'&gt;Bender and Friedman (2018)&lt;/a&gt;. In their view, a standard called &lt;em&gt;data statements &lt;/em&gt;would require the creators of a new dataset to describe the circumstances of creation, and from the users of the dataset to reiterate this statement in a brief paragraph everytime they use it. While Bender and Friedman do not assume that all bias can be removed from systems through a proper declaration of data, they hope that the research could be better contextualized.&lt;/p&gt;
&lt;p&gt;Apart from preventive approaches, a lot of technical solutions for &lt;em&gt;de-biasing&lt;/em&gt; a system post hoc have been proposed &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Tolga Bolukbasi, Kai-Wei Chang, James&amp;nbsp;Y Zou, Venkatesh Saligrama, and Adam&amp;nbsp;Tauman Kalai.
&lt;span class="bibtex-protected"&gt;Quantifying and Reducing Stereotypes in Word Embeddings&lt;/span&gt;.
&lt;em&gt;CoRR&lt;/em&gt;, 2016.
URL: &lt;a href="http://arxiv.org/abs/1606.06121"&gt;http://arxiv.org/abs/1606.06121&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/1606.06121"&gt;arXiv:1606.06121&lt;/a&gt;.' href='#Bolukbasi2016' id='ref-Bolukbasi2016-1'&gt;(Bolukbasi et al., 2016)&lt;/a&gt;. Furthermore, new forms of testing have been proposed to measure the gender-biasedness of NLP systems &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang.
&lt;span class="bibtex-protected"&gt;Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints&lt;/span&gt;.
In &lt;em&gt;Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 2979–2989. 2017.' href='#Zhao2017' id='ref-Zhao2017-1'&gt;(Zhao et al., 2017)&lt;/a&gt;. Those solutions are often tailored to a specific scenario and do not offer a systemic solution &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Hila Gonen and Yoav Goldberg.
Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them.
In &lt;em&gt;Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)&lt;/em&gt;, 609–614. 2019.' href='#Gonen2019' id='ref-Gonen2019-1'&gt;(Gonen and Goldberg, 2019)&lt;/a&gt;. But they show that researchers have become aware of latent bias and that there is a discussion on how to avoid it.&lt;/p&gt;
&lt;h2&gt;The three roles of gender in NLP&lt;/h2&gt;
&lt;p&gt;On the other hand, systems that make &lt;em&gt;explicit use of gender &lt;/em&gt;are problematized less frequently, even though the concept of gender has been challenged by Gender Studies and related fields for decades. For example, a recent interdisciplinary review of gender classification fails to mention the existence of constructivist or non-binary approaches to gender entirely &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Feng Lin, Yingxiao Wu, Yan Zhuang, Xi&amp;nbsp;Long, and Wenyao Xu.
&lt;span class="bibtex-protected"&gt;Human gender classification: a review&lt;/span&gt;.
&lt;em&gt;International Journal of Biometrics&lt;/em&gt;, 8(3-4):275–300, 2016.' href='#Lin2016' id='ref-Lin2016-1'&gt;(Lin et al., 2016)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In a first set of &lt;em&gt;guidelines&lt;/em&gt; on the explicit use of gender in NLP, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Brian Larson.
&lt;span class="bibtex-protected"&gt;Gender as a Variable in Natural-Language Processing: Ethical Considerations&lt;/span&gt;.
In &lt;em&gt;Proceedings of the First ACL Workshop on Ethics in Natural Language Processing&lt;/em&gt;, 1–11. 2017.' href='#Larson2017' id='ref-Larson2017-1'&gt;Larson (2017)&lt;/a&gt; proposes the following: Researchers should always specify what concept of gender they employ, even if just means quoting a classic definition. In addition, researchers should only utilize information on gender if it is necessary to achieve the research objective, and not just because the data are already available or because it is easier to ask the question “What is your gender?” than it is to ask other questions. Researchers should also specify what method they use to distinguish between genders in the annotation process.&lt;/p&gt;
&lt;p&gt;Larson’s guidelines are a balanced ethical foundation for future NLP research into gender. However, there is one aspect that goes unmentioned by Larson: How NLP technology could me misused by malicious people to discriminate on the basis of gender. Would it be prudent to stop NLP research altogether in order to prevent abuse?&lt;/p&gt;
&lt;p&gt;When discussing the consequences of technology, ethicists often employ the concept of &lt;em&gt;dual use &lt;/em&gt;&lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Dirk Hovy and Shannon&amp;nbsp;L Spruit.
&lt;span class="bibtex-protected"&gt;The social impact of natural language processing&lt;/span&gt;.
In &lt;em&gt;Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)&lt;/em&gt;, volume&amp;nbsp;2, 591–598. 2016.' href='#Hovy2016' id='ref-Hovy2016-1'&gt;(Hovy and Spruit, 2016)&lt;/a&gt;. To give an example, every system that can be used for the inference of gender from language (&lt;em&gt;gender profiling&lt;/em&gt;) can also be used to rewrite the text such as to prevent this inference (&lt;em&gt;gender obfuscation&lt;/em&gt;; &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Sravana Reddy and Kevin Knight.
&lt;span class="bibtex-protected"&gt;Obfuscating gender in social media writing&lt;/span&gt;.
In &lt;em&gt;Proceedings of the First Workshop on NLP and Computational Social Science&lt;/em&gt;, 17–26. 2016.' href='#Reddy2016' id='ref-Reddy2016-1'&gt;Reddy and Knight (2016)&lt;/a&gt;). Another – reverse – example are systems that can detect misogynist tweets but that could also be misused to automatically generate misogynist speech).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Style transfer experiment that could also be used for obfuscation of gender (Lample et. al 2019)" src="https://vamvas.ch/assets/gender-equality/style-transfer.png"&gt;
&lt;em&gt;Style transfer experiment that could also be used for obfuscation of gender &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc&amp;#39;Aurelio Ranzato, and Y-Lan Boureau.
Multiple-attribute text rewriting.
In &lt;em&gt;International Conference on Learning Representations&lt;/em&gt;. 2019.
URL: &lt;a href="https://openreview.net/forum?id=H1g2NhC5KQ"&gt;https://openreview.net/forum?id=H1g2NhC5KQ&lt;/a&gt;.' href='#lample2018multipleattribute' id='ref-lample2018multipleattribute-1'&gt;(Lample et al., 2019)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I believe that due its specific nature, NLP has a &lt;em&gt;third use&lt;/em&gt; in addition to this dual use: Because they can analyze data on a large scale, NLP systems can inform a critical public of preexistent bias that manifests in natural-language texts. There are many studies where this third, &lt;em&gt;sociolinguistic-diagnostic use&lt;/em&gt; is applied, from the analysis of letters of recommendation for male/female job applicants &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Toni Schmader, Jessica Whitehead, and Vicki&amp;nbsp;H Wysocki.
&lt;span class="bibtex-protected"&gt;A Linguistic Comparison of Letters of Recommendation for Male and Female Chemistry and Biochemistry Job Applicants&lt;/span&gt;.
&lt;em&gt;Sex Roles&lt;/em&gt;, 57:509–514, 2007.' href='#Schmader2007' id='ref-Schmader2007-1'&gt;(Schmader et al., 2007)&lt;/a&gt; to the analysis of questions that sports journalists pose to male/female tennis players &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Liye Fu, Cristian Danescu-Niculescu-Mizil, and Lillian Lee.
&lt;span class="bibtex-protected"&gt;Tie-breaker: Using language models to quantify gender bias in sports journalism&lt;/span&gt;.
In &lt;em&gt;Proceedings of the IJCAI workshop on NLP meets Journalism&lt;/em&gt;. 2016.' href='#Fu2016' id='ref-Fu2016-1'&gt;(Fu et al., 2016)&lt;/a&gt;. In another example, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Ethan Fast, Tina Vachovsky, and Michael&amp;nbsp;S Bernstein.
&lt;span class="bibtex-protected"&gt;Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community&lt;/span&gt;.
In &lt;em&gt;Tenth International AAAI Conference on Web and Social Media&lt;/em&gt;. 2016.' href='#Fast2016' id='ref-Fast2016-1'&gt;Fast et al. (2016)&lt;/a&gt; analyze amateur fiction using NLP and find an abundance of gender stereotypes in every genre, irrespective of the author’s gender. As a final example, &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou.
&lt;span class="bibtex-protected"&gt;Word embeddings quantify 100 years of gender and ethnic stereotypes&lt;/span&gt;.
&lt;em&gt;Proceedings of the National Academy of Sciences&lt;/em&gt;, 115(16):E3635–E3644, 2018.' href='#Garg2018' id='ref-Garg2018-1'&gt;Garg et al. (2018)&lt;/a&gt; quantify the historical development of stereotypes based on Google Books, and show a correlation with U.S. census data.&lt;/p&gt;
&lt;p&gt;This third use of NLP is clearly a chance to promote gender equality, even though some may criticise that all those studies assume a &lt;em&gt;binary view of gender&lt;/em&gt;. I think that according to the guidelines by &lt;a class='citation-link' data-toggle='tooltip' data-html='true' title='Brian Larson.
&lt;span class="bibtex-protected"&gt;Gender as a Variable in Natural-Language Processing: Ethical Considerations&lt;/span&gt;.
In &lt;em&gt;Proceedings of the First ACL Workshop on Ethics in Natural Language Processing&lt;/em&gt;, 1–11. 2017.' href='#Larson2017' id='ref-Larson2017-2'&gt;Larson (2017)&lt;/a&gt;, this simplification can be justified as there is a clear research objective: making discrimination visible.&lt;/p&gt;
&lt;p&gt;In my view, NLP offers chances as well a threats to gender equality, and the threats can have various sources: Preexistent societal bias, emergent bias, or the unreflected use of gender as an explicit variable. Given the promises that NLP holds for diagnosing a discriminatory use of language, there is hope that the opportunities will eventually outweigh the threats.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This is an abridged version of an essay written as part of the &lt;a href="http://web.archive.org/web/20190415172302/https://www.frauenbeauftragte.uni-muenchen.de/weiterbildung/plus/genderzertifikat/index.html"&gt;certificate program “Gender and Diversity Competence” at LMU Munich&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;&lt;div id="bibliography" class="appendix"&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;p id='Bender2018'&gt;Emily&amp;nbsp;M Bender and Batya Friedman.
Data statements for natural language processing: toward mitigating system bias and enabling better science.
&lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, 6:587–604, 2018. &lt;a class="cite-backref" href="#ref-Bender2018-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Bolukbasi2016'&gt;Tolga Bolukbasi, Kai-Wei Chang, James&amp;nbsp;Y Zou, Venkatesh Saligrama, and Adam&amp;nbsp;Tauman Kalai.
&lt;span class="bibtex-protected"&gt;Quantifying and Reducing Stereotypes in Word Embeddings&lt;/span&gt;.
&lt;em&gt;CoRR&lt;/em&gt;, 2016.
URL: &lt;a href="http://arxiv.org/abs/1606.06121"&gt;http://arxiv.org/abs/1606.06121&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/1606.06121"&gt;arXiv:1606.06121&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-Bolukbasi2016-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Caliskan2017'&gt;Aylin Caliskan, Joanna&amp;nbsp;J Bryson, and Arvind Narayanan.
&lt;span class="bibtex-protected"&gt;Semantics derived automatically from language corpora contain human-like biases&lt;/span&gt;.
&lt;em&gt;Science&lt;/em&gt;, 356(6334):183–186, 2017. &lt;a class="cite-backref" href="#ref-Caliskan2017-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-Caliskan2017-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-Caliskan2017-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id='Fast2016'&gt;Ethan Fast, Tina Vachovsky, and Michael&amp;nbsp;S Bernstein.
&lt;span class="bibtex-protected"&gt;Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community&lt;/span&gt;.
In &lt;em&gt;Tenth International AAAI Conference on Web and Social Media&lt;/em&gt;. 2016. &lt;a class="cite-backref" href="#ref-Fast2016-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Friedman1996'&gt;Batya Friedman and Helen Nissenbaum.
&lt;span class="bibtex-protected"&gt;Bias in computer systems&lt;/span&gt;.
&lt;em&gt;ACM Transactions on Information Systems (TOIS)&lt;/em&gt;, 14(3):330–347, 1996. &lt;a class="cite-backref" href="#ref-Friedman1996-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Fu2016'&gt;Liye Fu, Cristian Danescu-Niculescu-Mizil, and Lillian Lee.
&lt;span class="bibtex-protected"&gt;Tie-breaker: Using language models to quantify gender bias in sports journalism&lt;/span&gt;.
In &lt;em&gt;Proceedings of the IJCAI workshop on NLP meets Journalism&lt;/em&gt;. 2016. &lt;a class="cite-backref" href="#ref-Fu2016-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Garg2018'&gt;Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou.
&lt;span class="bibtex-protected"&gt;Word embeddings quantify 100 years of gender and ethnic stereotypes&lt;/span&gt;.
&lt;em&gt;Proceedings of the National Academy of Sciences&lt;/em&gt;, 115(16):E3635–E3644, 2018. &lt;a class="cite-backref" href="#ref-Garg2018-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Gonen2019'&gt;Hila Gonen and Yoav Goldberg.
Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them.
In &lt;em&gt;Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)&lt;/em&gt;, 609–614. 2019. &lt;a class="cite-backref" href="#ref-Gonen2019-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Hendricks2018'&gt;Lisa&amp;nbsp;Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach.
&lt;span class="bibtex-protected"&gt;Women also snowboard: Overcoming bias in captioning models&lt;/span&gt;.
In &lt;em&gt;European Conference on Computer Vision&lt;/em&gt;, 793–811. 2018. &lt;a class="cite-backref" href="#ref-Hendricks2018-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-Hendricks2018-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-Hendricks2018-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id='Hovy2016'&gt;Dirk Hovy and Shannon&amp;nbsp;L Spruit.
&lt;span class="bibtex-protected"&gt;The social impact of natural language processing&lt;/span&gt;.
In &lt;em&gt;Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)&lt;/em&gt;, volume&amp;nbsp;2, 591–598. 2016. &lt;a class="cite-backref" href="#ref-Hovy2016-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='lample2018multipleattribute'&gt;Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc&amp;#39;Aurelio Ranzato, and Y-Lan Boureau.
Multiple-attribute text rewriting.
In &lt;em&gt;International Conference on Learning Representations&lt;/em&gt;. 2019.
URL: &lt;a href="https://openreview.net/forum?id=H1g2NhC5KQ"&gt;https://openreview.net/forum?id=H1g2NhC5KQ&lt;/a&gt;. &lt;a class="cite-backref" href="#ref-lample2018multipleattribute-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Larson2017'&gt;Brian Larson.
&lt;span class="bibtex-protected"&gt;Gender as a Variable in Natural-Language Processing: Ethical Considerations&lt;/span&gt;.
In &lt;em&gt;Proceedings of the First ACL Workshop on Ethics in Natural Language Processing&lt;/em&gt;, 1–11. 2017. &lt;a class="cite-backref" href="#ref-Larson2017-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;a class="cite-backref" href="#ref-Larson2017-1" title="Jump back to reference 1"&gt; &lt;sup&gt;1&lt;/sup&gt; &lt;/a&gt;&lt;a class="cite-backref" href="#ref-Larson2017-2" title="Jump back to reference 2"&gt;&lt;sup&gt;2&lt;/sup&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p id='Lin2016'&gt;Feng Lin, Yingxiao Wu, Yan Zhuang, Xi&amp;nbsp;Long, and Wenyao Xu.
&lt;span class="bibtex-protected"&gt;Human gender classification: a review&lt;/span&gt;.
&lt;em&gt;International Journal of Biometrics&lt;/em&gt;, 8(3-4):275–300, 2016. &lt;a class="cite-backref" href="#ref-Lin2016-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Nguyen2016'&gt;Dong Nguyen, A&amp;nbsp;Seza Doğru&lt;span class="bibtex-protected"&gt;&lt;span class="bibtex-protected"&gt;ö&lt;/span&gt;&lt;/span&gt;z, Carolyn&amp;nbsp;P Ros&lt;span class="bibtex-protected"&gt;&lt;span class="bibtex-protected"&gt;é&lt;/span&gt;&lt;/span&gt;, and Franciska de&amp;nbsp;Jong.
&lt;span class="bibtex-protected"&gt;Computational sociolinguistics: A survey&lt;/span&gt;.
&lt;em&gt;Computational linguistics&lt;/em&gt;, 42(3):537–593, 2016. &lt;a class="cite-backref" href="#ref-Nguyen2016-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Reddy2016'&gt;Sravana Reddy and Kevin Knight.
&lt;span class="bibtex-protected"&gt;Obfuscating gender in social media writing&lt;/span&gt;.
In &lt;em&gt;Proceedings of the First Workshop on NLP and Computational Social Science&lt;/em&gt;, 17–26. 2016. &lt;a class="cite-backref" href="#ref-Reddy2016-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Schmader2007'&gt;Toni Schmader, Jessica Whitehead, and Vicki&amp;nbsp;H Wysocki.
&lt;span class="bibtex-protected"&gt;A Linguistic Comparison of Letters of Recommendation for Male and Female Chemistry and Biochemistry Job Applicants&lt;/span&gt;.
&lt;em&gt;Sex Roles&lt;/em&gt;, 57:509–514, 2007. &lt;a class="cite-backref" href="#ref-Schmader2007-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Tatman2017'&gt;Rachael Tatman.
&lt;span class="bibtex-protected"&gt;Gender and Dialect Bias in YouTube&amp;#39;s Automatic Captions&lt;/span&gt;.
In &lt;em&gt;Proceedings of the First ACL Workshop on Ethics in Natural Language Processing&lt;/em&gt;, 53–59. 2017. &lt;a class="cite-backref" href="#ref-Tatman2017-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;p id='Zhao2017'&gt;Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang.
&lt;span class="bibtex-protected"&gt;Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints&lt;/span&gt;.
In &lt;em&gt;Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 2979–2989. 2017. &lt;a class="cite-backref" href="#ref-Zhao2017-1" title="Jump back to reference 1"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</content><category term="blog"></category></entry></feed>