Your term project will be a self-designed multi-week team-based effort. You are welcome to select one of the project ideas that we suggest, or to design your own. Your final project will consist of the following components:
We’re going to split up the work on the term project into several deliverables, each with their own due dates. You don’t have to wait to start working on each part of the project. We encourage you to begin work early, so that you have a polished final product.
Here’s a provisional list of the deliverables that you’ll need to submit at the end of the project. Each milestone will also have its own deliverables
report.pdf: The final version of your write-up, incorporating any additional changes to your revised draft (if any). The report should written in LaTeX and formatted using the NAACL conference template.
readme.md: A brief description of your task and a description of your code.
data-train/: A directory containing the training data.
data-dev/: A directory containing the development data for local evaluation.
data-test/: A directory containing the test data for leaderboard evaluation.
default: A full implementation of the default system.
baseline: An implementation of your baseline system.
extension-2, …: Full implementations of the extensions, one per group member.
grade-dev: A grading script for local evaluation. This may be a wrapper around a generic grading script
grade-test: A grading script for leaderboard evaluation. This may be a wrapper around a generic grading script
For Milestone 1, you’ll need to form a team and come up with 3 project ideas. For each idea you should describe:
The term project is a team exercise. The minimum team size is 4, and the max team size is 6. If you need help finding a team, you can post on this Piazza thread.
You should turn in a PDF with your 3 ideas and your team. If you have a prefrence on which project you’d like to pursue, you’re welcome to indicate that in your report too.
You should identify what topic you would like to work on. Pretty much any topic in natural language processing is fair game. The first milestone for the term project is to pick 3 topic ideas that your team might be interested in exploring. The course staff will help assess the feasibility of your ideas and will make a recommendation of which of your 3 initial ideas is the best fit for the scope of the term project.
Bias in word vectors - Perhaps suprisingly, there is gender and racial bias in word embeddings. For your term project, you can replicate the findings in Semantics derived automatically from language corpora contain human-like biases by Aylin Caliskan, Joanna Bryson, Arvind Narayanan (2017) or in Word embeddings quantify 100 years of gender and ethnic stereotypes by Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. They show that word vectors trained on web data encode a spectrum of known biases, as measured by the Implicit Association Test. This is largely due to the fact that people’s biases are expressed in their writing and thus in the data we use to train our emeddings. After you recreate these results, you can see if it is possible to remove the bias by anonymizing names and gender nouns and pronouns in the training data. You could then retrain word embeddings (glove, word2vec and/or Fasttext) and then measure see if the bias is still present.
Identifying words to anonymize - Clinical records with protected health information (PHI) cannot be directly shared as is, due to privacy constraints, making it particularly cumbersome to carry out NLP research in the medical domain. A necessary precondition for accessing clinical records outside of hospitals is their de-identification, i.e., the exhaustive removal (or replacement) of all mentioned PHI phrases. To determine how well we are able to identify PHI phrases, a group has prepared Medical Document Anonymization Task (MedDocAn). The MedDocAn task is run on a synthetic corpus of 1000 clinical case studies. This corpus was selected manually by a practicing physician and augmented with PHI information from discharge summaries and medical genetics clinical records. The challenge for this project will be to perform entity recognition on the data, and detect sensitive spans. Find information at the MedDocAn task page.
Identifying claims and perspectives that support or refute them - There are many ways to respond to a claim such as “animals should have lawful rights”, and these responses form a spectrum of perspectives, each with a stance relative to this claim and, ideally, with evidence supporting it. Inherently, this is a natural language understanding task. You can address the task of substantiated perspective discovery where, given a claim, a system is expected to discover a diverse set of well-corroborated perspectives that take a stance with respect to the claim. Each perspective should be substantiated by evidence paragraphs which summarize pertinent results and facts. We recently created PERSPECTRUM, a dataset of claims, perspectives and evidence, making use of online debate websites to create the initial data collection, and augmenting it using search engines in order to expand and diversify our dataset.
Training emeddings with different types of contexts - word2vec and GloVe train word embeddings using local context information like a small window surrounding words. You could implement software for training word embeddings from different types of contexts, like widening the narrow context windows to complete documents, or contexts like Dependency-Based Word Embeddings by Omer Levy and Yoav Goldberg. Perform a systematic analysis of how different contexts change the learned embeddings, using the evaluation methodology outlined in Evaluation of Word Vector Representations by Subspace Alignment by Yulia Tsvetkov et al.
Create cross-lingual embeddings - experiment with various methods for creating cross-lingual word embeddings and evaluate how good each method is at learning missing entries in a bilingual dictionary. Here’s a set of 100 bilingual dictionaries that you can use.
Commonsense inference with BERT and SWAG - BERT has shown to be very effective at many language understanding tasks. For your project you can evaluate how well BERT solves Grounded Commonsense Inference in the new SWAG data set. There’s a github repo with BERT as a service that should allow you to get up and running quickly. For extensions for this project, you can try to replicated SWAG’s adversarial dataset collection methodology, but with BERT as the model.
Order prenominal modifiers - In English, prenominal modifiers must come in a certain order. It sounds fluent to say the big beautiful white wooden house, but not the white wooden beautiful big house. Here’s a NLP good paper describing a class-based approach to ordering prenominal modifiers. You could collect all of the pre-nominal modifiers from a large parsed corpus like the WaCKy corpora or the Annotated Gigaword, and then train a model to predict their order. Here’s a rule from a grammar book about what order adjectives are supposed to come in. Is it true?
Things native English speakers know, but don't know we know: pic.twitter.com/Ex0Ui9oBSL— Matthew Anderson (@MattAndersonNYT) September 3, 2016
In addition to these ideas, you can check out the numerous shared tasks that are run by the NLP community. Shared tasks are a good fit for the term project, because they provide shared data, establish evaluation metrics, and there will be several publications describing how researchers approached the tasks.
The course staff will review the 3 ideas that you submitted for Milestone 1, and make a recommendation on which of your ideas you ought to pursue. For Milestone 2, your job is to get started on that idea with three steps:
We have also assigned a course staff member to be your mentor. You should sign up for a one-on-one meeting with your mentor to talk through your project.
Since most of the projects that we do in this course are data-driven, it’s very important to have your data ready to go at the outset of a project. You should collect all of the data that you’ll need for your term project and split the data into three pieces:
The training data will be used to train the model, the dev data can be used to optimize your system parameters and/or to evaluate different approaches to the problem, the test data is a “blind” test set that will be used in the final evaluation.
If you are basing your term project on a shared task, then usually the data will be collected already, and usually it will be divided into a standard training/dev/test split. If it’s already assembled and split - great! You’re ahead of the game. If you’re not doing a shared task, then you may need to assemble your own data. A good way of creating your own training/dev/test split is to divide the data into chunks that are sized around 80%/10%/10%, where you want to use most of the data for training. It’s important to ensure that the same items don’t appear in more than one of the splits.
For your M2 deliverables, we’ll ask you to submit your data, plus a markdown file named data.md that describes the format of the data. If your data is very large, then you can submit a sample of the data and give a link to a Google Drive that contains the full data set. You data.md should describe the number of items in each of your training/dev/test splits.
For the next part of M2, you’ll need to determine a suitable evaluation metric for your task, and implement it. If you’re basing your term project on a shared task, then there is likely an established evaluation metric for the task. You should re-use it. If you’re doing a new task, then you may have to do a literature review in order to determine what metrics are best suited for your task.
You should write an evaluation script that takes two things as input: a system’s output and a corresponding set of gold standard answers. Your script should output a number that quantifies how good the system’s answers are.
For your deliverables, you should include your script, plus an example of how to run it from the command line. You should give a formal definition of the evaluation metric that explains how it is calculated in a markdown file called scoring.md - this file should cite any relevant papers that introduce the metric. You can also cite Wikipedia articles that describe your evaluation metric, and/or link to an overview paper describing the shared task that you’re basing your project on if it defines the metric.
As the final part of M2, you should write a simple baseline. This should be the simplest way of producing output for your task. For example, it could be a majority class baseline (like the one that we used in HW1) that determines the majority class from the training data and guesses that class for each item in the test set.
You should write a python program that will generate the output for the baseline, and you should submit that as simple-baseline.py. You should also include a markdown file named simple-baseline.md that describes your simple baseline, gives sample output, and reports the score of the baseline when you run it on the test set, and evaluate it with your scoring script.
The goals of Milestone 3 are to do a literature review to determine the approaches that other researchers took to solve your problem, and to implement a published system to establish as a strong baseline for your project.
For your literature review, you should read 3-5 research papers that address the problem that you are working on. You should write a 1-2 paragraph summary of each paper, desribing the approaches that they proposed and what results they got. You should also include an addition 1-2 paragraphs saying which of the approaches that you selected as the published baseline that you are re-implementing. You should submit your literature review in a markdown formatted file called lit-review.md.
You should re-implement the published baseline that you selected. It’s fine to use machine learning packages like sklearn, or NLP software like Spacy or NLTK, but you should implement the main algorithms yourself. You should not turn in existing code that implements the baseline.
You should include a baseline.md markdown file that includes step-by-step instructions on how to run your baseline code. Your baseline.md should also report the score for your system for your test and development data, and compare that to your random baseline.
For Milestone 3, you will also prepare a presentation for your project. Your in-class presentation should be 12 minutes long. You should create a slidedeck with Google Slides. Your presentation should convey these main ideas:
You may also want to cover topics like this:
For Milestone 4, you’ll need to implement several extensions beyond this published baseline. These should be different experiments that you run to try to improve its performance. The number of extension that you’ll implement depends on number of members of your group. If you have 4 team members, you should implement 2 extensions. If you have 5, then 3 extensions. If you have 6, then 4 extensions.
For your final milestone, you’ll complete your extensions to the baseline, and you’ll produce a final writeup for your term project. As a reminder, the number of extensions that you must submit depends on your group size. If you have 4 team members, you should implement 2 extensions. If you have 5, then 3 extensions. If you have 6, then 4 extensions.
Your final report should be written in the style of a scientific paper, and formatted with this LaTeX style file (which will make it look totally scientific!). Your report should contain the following sections:
I really like examples and good illustrations. If you created some nice visuals for your final presentation slides, then I encourage you to include them in your writeup too. You can submit your images in a images/ subfolder.
You should turn the following items:
You’ve reached the end. Great job!