CSCI 699: Robustness and Generalization in Natural Language Processing

Note: As per University policy, class will be held remotely for the first two weeks of the semester. Please use the Zoom link on Blackboard.

In natural language processing (NLP), we set out to solve language-related tasks (e.g., machine translation, question answering) but often evaluate on narrow, in-distribution test datasets. With recent advances in deep learning, modern systems have achieved high accuracy on many canonical datasets, but still seem far from solving general tasks. In this class, we will survey recent research on robustness and generalization that studies this gap between in-distribution accuracy and task competency through out-of-distribution settings. We will learn about different settings in which NLP systems often fail to generalize well, including adversarial perturbations, settings that require compositional reasoning, and domain transfer. We will also learn about how average accuracy can mask disparate performance across subpopulations, and how this can lead to undesirable consequences. Across these topics, we will cover methods both for measuring these robustness and generalization issues and ways that we can improve model robustness and generalization.

Logistics

Office hours: Tuesdays 4-5pm ~~in SAL 236~~ on Zoom (link on Blackboard/Slack), or by appointment.
Assignments: Submit assignments on Blackboard. Feedback will also be provided on Blackboard.
Discussion: Please use the official course Slack channel for general questions. Email me (please put “CSCI 699” in the subject line) or come to office hours to discuss individual matters, such as project ideas or grading.

Prerequisites

Familiarity with natural language processing and/or machine learning at the level of CSCI 544 (Applied Natural Language Processing) or CSCI 567 (Machine learning). Please email me if you want to enroll but are unsure if you meet the prerequisites.

For those without prior NLP experience, I recommend going through Lena Voita’s NLP Course For You, which provides a concise and interactive introduction to modern NLP. For a more extensive introduction to NLP, I recommend Jurafsky and Martin’s Speech and Language Processing, whose third edition is available online and is very current.

Schedule

Date	Topic	Reading(s)	Additional reading(s)	Assignments
Mon Jan 10	Introduction
Wed Jan 12	The Turing Test: Lecture	Turing 1950, Shieber 2016	Shieber et al. 2004
Mon Jan 17	No class (Martin Luther King Day)
Wed Jan 19	Adversarial examples I: Lecture		Goodfellow et al. 2014, Adversarial ML Tutorial
Mon Jan 24	Adversarial examples II: Adversarial Perturbations	Pruthi et al. 2019, Jones et al. 2020	Ribeiro et al. 2018, Jia et al. 2019, Huang et al. 2019
Wed Jan 26	Adversarial examples III: Adversarial triggers	Wallace et al. 2019, Atanasova et al. 2020
Mon Jan 31	Adversarial examples IV: Model stealing, data poisoning	Krishna et al. 2020, Wallace et al. 2021	Wallace et al. 2020
Wed Feb 2	Domain adaptation I: Lecture		Ramponi and Plank, 2020
Mon Feb 7	Domain adaptation II: Unsupervised domain adaptation and pretraining	Blitzer et al. 2006, Han and Eisenstein 2019	Gururangan et al. 2020
Wed Feb 9	Domain adaptation III: Fair generalization tasks, empirical trends	Geiger et al. 2019, Miller et al. 2020	Fisch et al. 2019, Taori et al. 2021	Project proposal due Feb 11
Mon Feb 14	Spurious correlations I: Lecture		Imbens and Rubin, 2015, Imbens 2020, Feder et al., 2021
Wed Feb 16	Spurious correlations II: Dataset biases	Schwartz et al. 2017, Gururangan et al. 2018, Gardner et al. 2021	Poliak et al. 2018, Kaushik et al. 2018, Schuster et al. 2019, Ribeiro et al. 2020
Mon Feb 21	No class (Presidents’ Day)
Wed Feb 23	Spurious correlations III: Training-time strategies	Clark et al. 2019, Utama et al. 2020	Clark et al. 2020, Tu et al. 2020
Mon Feb 28	Spurious correlations IV: Counterfactual data augmentation	Kaushik et al. 2019, Joshi and He 2021	Gardner et al. 2020, Ross et al. 2021, Sen et al. 2021
Wed Mar 2	Fairness I: Lecture		Barocas, Hardt, and Narayanan
Mon Mar 7	Fairness II: Gender and race bias in NLP systems	Zhao et al. 2018, Rudinger et al. 2018, Sap et al. 2019	Blodgett et al. 2020, Field et al. 2021
Wed Mar 9	Fairness III: Bias in representations	Goldfarb-Tarrant et al. 2021, Vig et al. 2020	Caliskan et al. 2017
Mon Mar 14	No class (Spring break)
Wed Mar 16	No class (Spring break)
Mon Mar 21	Fairness IV: Distributionally robust optimization	Hashimoto et al. 2018, Sagawa et al. 2020	Oren et al. 2019, Michel et al. 2021
Wed Mar 23	Fairness V: Bias amplification	Zhao et al. 2017, Jia et al. 2020	Wang et al. 2019	Project progress report due Mar 25
Mon Mar 28	Compositionality I: Lecture	Fodor and Pylyshyn 1988	Coppock and Champollion, Szabó 2008
Wed Mar 30	Compositionality II: Measuring compositional behavior	Hupkes et al. 2020	Lake and Baroni 2018, Kim and Linzen 2020, Dankers et al. 2021
Mon Apr 4	Compositionality III: Modeling choices	Herzig et al. 2021, Csordás et al. 2021	Chen et al. 2020, Shaw et al. 2021, Furrer et al. 2021
Wed Apr 6	Dataset creation I: Adversarial data collection	Kaushik et al. 2021, Wallace et al. 2021	Wallace et al. 2019, Kiela et al. 2021
Mon Apr 11	Dataset creation II: Adversarial filtering	Le Bras et al. 2020, Phang et al. 2021	Swayamdipta et al. 2020
Wed Apr 13	Conclusion, Bonus topics
Mon Apr 18	Project presentations
Wed Apr 20	Project presentations
Mon Apr 25	Project presentations
Wed Apr 27	Project presentations			Project final report due May 6

Format

Class days marked as Introduction, Conclusion, or Lecture will be presentations by me. Other classes will be paper presentations and discussions led by 1-2 students. The expected format of these classes is:

60 minutes: Presentation on all papers
25 minutes: Small group discussion
25 minutes: Whole class discussion

Grading

Grades will be based on paper presentations (30%), discussion (10%), and a final project (60% total).

Paper presentations (30%). Students will be expected to present ~2 research papers (sometimes 3 short ones or 1 long one) and lead class discussion on these papers. The presentation should help everyone in the class understand these papers as well as relevant background material. The presenter should also prepare a few discussion questions to encourage discussion after the presentation. To help presenters prepare their presentations, each presentation day will also have an assigned proofreader. The presenter should send a draft of the presentation and discussion questions to the proofreader at least 48 hours in advance of the presentation, and the proofreader should give some feedback at least 24 hours in advance.

Paper discussion participation (10%). Students are expected to participate in class discussions. This includes asking questions during presentations as well as voicing opinions on discussion topics.

Final project (60% total). Students must complete a final research project on a topic related to the class. Projects may be conducted individually or in groups of two. This project is expected to include novel research on either evaluation methodology for identifying problems with models related to robustness, generalization, or fairness, or modeling innovations for improving robustness, generalization, fairness, or other related aspects of model behavior. Please come to office hours or email me if you have questions related to choosing a project direction.

Final project

The final project is worth 60% of the total grade. Points will be allocated as follows:

Project proposal (5%). Students should submit a ~2-page (minimum) proposal for their project by the end of Week 5 (February 11). The proposal should describe the goal of the project and include a survey of related work. When reading these proposals, I will be looking for the following:

Do you have a clear plan for your final project?
- Clearly state your problem statement or goal, and how this relates to previously studied problems in the literature.
- Describe an idea for your method. It does not need to be guaranteed to work, but it should come with a clear plan of how you would carry this out.
- Describe what resources you will need (compute, data, models) and whether you have access to these.
- State why this project is relevant to the course themes (broadly construed), if not obvious
Is there a reasonable chance this plan will succeed?
- Summarize what is known in the literature about this problem and about methods like the one you’ve proposed, and use this to argue that your method makes sense for this problem.
- Optionally, include results of preliminary experiments (not at all expected but helpful to judge likelihood of success)

Project progress report (10%). Students should submit a ~5-page progress report for their project by the end of Week 10 (March 25). This should once again describe the project’s goals (which may have changed since the proposal), initial results, and a concrete plan of what will be done for the final report. While the initial results need not be positive, students are expected to have made non-trivial implementation progress by this point. For parts of the report describing project goals and plans, the expectations are largely the same as for the proposal. In addition, I will be looking for the following:

Why did you choose to do the experiments you did? What hypotheses are you testing?
Technical detail about what experiments were conducted. The level of description should be sufficient for someone else to be able to reproduce your experiments.
Analysis of results. What conclusions can you draw from these results? Or if they are inconclusive, what further experiments are needed so that you can draw some conclusions?

Project final presentation (20%). This will be a 20-30 minute presentation during the last two weeks of class. Students should describe the motivation for their work, relevant background material, and results. I encourage students to present both positive and negative results. There will also be some time for audience questions.

Project final report (25%). Students should submit a ~8-page final report detailing all aspects of their project (due May 6). The report should be structured like a conference paper. Parts of the proposal and progress report may be reused for the final report.

Regarding structure: If you’re not sure what to do, I recommend looking back at some of the papers we read this semester from ACL/EMNLP and using that paper’s structure as a template. Broadly speaking, every paper should have an abstract followed by sections for Introduction, Problem Statement, Approach, Experiments, and Discussion. Related work can go a couple places, usually either after the introduction or mixed with Discussion. (My rough rule of thumb: put Related Work after the introduction if there are significant prerequisites to understanding the context of your paper that cannot be adequately summarized in the introduction. Otherwise, mix Related Work with Discussion, as the paper will flow better if it goes directly from Introduction to Problem Statement.) You should end with some sort of conclusion–it can be its own section or just the end of the Discussion, but it should wrap up and provide some forward looking thoughts.
Similarly, use proper LaTeX formatting. Use the ACL LaTeX template linked below as your guide. One common thing I see is confusion between \citet{} and \citep{}. Use \citet{} whenever the work you are citing is playing the role of a noun in your sentence. If you’re saying something like, “As shown in Pruthi et al. (2019), BERT is vulnerable to adversarial typos,” that should be written with \citet{}.
It is of course important to report experimental results, but it is equally important to analyze them. What conclusions can be drawn from them? What have we learned by doing these experiments? Don’t expect the reader to infer everything you want them to from your results table—it’s your job to tell the reader what your results mean.
Negative results will not be penalized, but should be accompanied with detailed analysis of why the proposed method did not work as anticipated. For example, did you have an underlying hypothesis about why your method would work? If your method did not work, was it because that hypothesis was not true? What do the negative results teach us about NLP models that you did not anticipate?

All written project-related assignments should use the standard *ACL paper submission template (Log in to Overleaf and go to Menu -> Copy Project). All due dates are 11:59pm PST on Friday.

Late days

You are given 4 late days to use for the project proposal and progress report (no late days for the final report), to be used in integer amounts and distributed as you see fit. Additional late days will result in a deduction of 10% of the grade on the corresponding assignment per day.

Project resources

Google Colab provides free computational resources, though there are limits (e.g., jobs can only run for 12 hours at a time). See their FAQ for details.