In this assignment, we will examine some advanced uses of vector representations of words. We are going to look at two different problems:
In order to use the gensim package, you’ll have to be using Python version 3.6 or higher. On my Mac, I did the following:
brew install python3
pip3 install gensim
python3instead of just
Here are the materials that you should download for this assignment:
question1.txtA template for answering question 1.
data.zipContains all the data
vectorcluster.pyMain code stub
makecooccurrences.pyScript to make cooccurrences (optional use)
Word2vec is a very cool word embedding method that was developed by Thomas Mikolov and his collaborators. One of the noteworthy things about the method is that it can be used to solve word analogy problems like man is to king as woman is to [blank] The way that it works is to perform vector math. They take the vectors representing king, man and woman and perform some vector arithmetic to produce a vector that is close to the expected answer. $king−man+woman \approx queen$ So We can find the nearest a vector in the vocabulary by looking for $argmax \ cos(x, king-man+woman)$. Omar Levy has a nice explnation of the method in this Quora post, and in his paper Linguistic Regularities in Sparse and Explicit Word Representations.
In addition to solving this sort of analogy problem, the same sort of vector arithmetic was used with word2vec embeddings to find relationships between pairs of words like the following:
In the first part of this homework, you will play around with the gensim library library. You will use
gensim load a dense vector model trained using
word2vec, and use it to manipulate and analyze the vectors.
You can start by experimenting on your own, or reading through this tutorial on using word2vec with gensim. You should familiarize yourself with the KeyedVectors documentation.
The questions below are designed to familiarize you with the
gensim Word2Vec package, and get you thinking about what type of semantic information word embeddings can encode. You’ll submit your answers to these questions when you submit your other homework materials.
Load the word vectors using the following Python commands:
picnicitself)? (Use the function
['tissue', 'papyrus', 'manila', 'newsprint', 'parchment', 'gazette'](Use the function
We have provided a file called
question1.txt for you to submit answers to the questions above.
Many natural language processing (NLP) tasks require knowing the sense of polysemous words, which are words with multiple meanings. For example, the word bug can mean
In past research my PhD students and I have looked into automatically deriving the different meaning of polysemous words like bug by clustering their paraphrases. We have developed a resource called the paraphrase database (PPDB) that contains of paraphrases for tens of millions words and phrases. For the target word bug, we have an unordered list of paraphrases including: insect, glitch, beetle, error, microbe, wire, cockroach, malfunction, microphone, mosquito, virus, tracker, pest, informer, snitch, parasite, bacterium, fault, mistake, failure and many others. We used automatic clustering group those into sets like:
These clusters approximate the different word senses of bug. You will explore the main idea underlying our word sense clustering method: which measure the similarity between each pair of paraphrases for a target word and then group together the paraphrases that are most similar to each other. This affinity matrix gives an example of one of the methods for measuring similarity that we tried in our paper:
Here the darkness values give an indication of how similar paraprhases are to each other. For instance sim(insect, pest) > sim(insect, error).
In this assignment, we will use vector representations in order to measure their similarities of pairs of paraprhases. You will play with different vector space representations of words to create clusters of word senses.
In this image, we have a target word “bug”, and a list of all synonyms (taken from WordNet). The 4 circles are the 4 senses of “bug.” The input to the problem is all the synonyms in a single list, and the task is to separate them correctly. As humans, this is pretty intuitive, but computers aren’t that smart. We will use this task to explore different types of word representations.
We expect that you have read Jurafsky and Martin, chapters 15 and 16. Word vectors, also known as word embeddings, can be thought of simply as points in some high-dimensional space. Remember in geometry class when you learned about the Euclidean plane, and 2-dimensional points in that plane? It’s not hard to understand distance between those points – you can even measure it with a ruler. Then you learned about 3-dimensional points, and how to calculate the distance between these. These 3-dimensional points can be thought of as positions in physical space.
Now, do your best to stop thinking about physical space, and generalize this idea in your mind: you can calculate a distance between 2-dimensional and 3-dimensional points, now imagine a point with 300 dimensions. The dimensions don’t necessarily have meaning in the same way as the X,Y, and Z dimensions in physical space, but we can calculate distances all the same.
This is how we will use word vectors in this assignment: as points in some high-dimensional space, where distances between points are meaningful. The interpretation of distance between word vectors depends entirely on how they were made, but for our purposes, we will consider distance to measure semantic similarity. Word vectors that are close together should have meanings that are similar.
With this framework, we can see how to solve our synonym clustering problem. Imagine in the image below that each point is a (2-dimensional) word vector. Using the distance between points, we can separate them into 3 clusters. This is our task.
(Image taken from Wikipedia)
The data to be used for this assignment consists of sets of paraphrases corresponding to one of 56 polysemous target words, e.g.
|note.v||comment mark tell observe state notice say remark mention|
|hot.a||raging spicy blistering red-hot live|
.v following the target
note indicates the part of speech.)
Your objective is to automatically cluster each paraphrase set such that each cluster contains words pertaining to a single sense, or meaning, of the target word. Note that a single word from the paraphrase set might belong to one or more clusters.
For evaluation, we take the set of ground truth senses from WordNet.
The development data consists of two files – a words file (the input), and a clusters file (to evaluate your output). The words file
dev_input.txt is formatted such that each line contains one target, its paraphrase set, and the number of ground truth clusters k, separated by a
target.pos :: k :: paraphrase1 paraphrase2 paraphrase3 ...
You can use k as input to your clustering algorithm.
The clusters file
dev_output.txt contains the ground truth clusters for each target word’s paraphrase set, split over k lines:
target.pos :: 1 :: paraphrase2 paraphrase6 target.pos :: 2 :: paraphrase3 paraphrase4 paraphrase5 ... target.pos :: k :: paraphrase1 paraphrase9
For testing, you will receive only words file
test_input.txt containing the test target words and their paraphrase sets. Your job is to create an output file, formatted in the same way as
dev_output.txt, containing the clusters produced by your system. Neither order of senses, nor order of words in a cluster matter.
There are many possible ways to evaluate clustering solutions. For this homework we will rely on the paired F-score, which you can read more about in this paper.
The general idea behind paired F-score is to treat clustering prediction like a classification problem; given a target word and its paraphrase set, we call a positive instance any pair of paraphrases that appear together in a ground-truth cluster. Once we predict a clustering solution for the paraphrase set, we similarly generate the set of word pairs such that both words in the pair appear in the same predicted cluster. We can then evaluate our set of predicted pairs against the ground truth pairs using precision, recall, and F-score.
We have provided an evaluation script that you can use when developing your own system. You can run it as follows:
python evaluate.py <GROUND-TRUTH-FILE> <PREDICTED-CLUSTERS-FILE>
On the dev data, a random baseline gets about 20%, the word cooccurrence matrix gets about 36%, and the word2vec vectors get about 30%.
Your next task is to generate clusters for the target words in
test_input.txt based on a feature-based (not dense) vector space representation. In this type of VSM, each dimension of the vector space corresponds to a specific feature, such as a context word (see, for example, the term-context matrix described in Chapter 15.1.2 of Jurafsky & Martin).
You will calculate cooccurrence vectors on the Reuters RCV1 corpus. Download a tokenized and cleaned version here. The original is here. Use the provided script,
makecooccurrences.py, to build these vectors. Be sure to set D and W to what you want.
It can take a long time to build cooccurrence vectors, so we have pre-built a set, included in the data.zip, called
coocvec-500mostfreq-window-3.vec.filter. To save on space, these include only the words used in the given files.
You will add K-means clustering to
vectorcluster.py. Here is an example of the K-means code:
The baseline system for this section represents words using a term-context matrix
M of size
|V| x D, where
|V| is the size of the vocabulary and D=500. Each feature corresponds to one of the top 500 most-frequent words in the corpus. The value of matrix entry
M[i][j] gives the number of times the context word represented by column
j appeared within W=3 words to the left or right of the word represented by row
i in the corpus. Using this representation, the baseline system clusters each paraphrase set using K-means.
While experimenting, write out clusters for the dev input to
dev_output_features.txt and use the
evaluate.py script to compare against the provided
Implementing the baseline will score you a B, but why not try and see if you can do better? You might try experimenting with different features, for example:
Din the baseline implementation?
Wused to extract contexts?
The only feature types that are off-limits are WordNet features.
Turn in the predicted clusters that your VSM generates in the file
test_output_features.txt. Also provide a brief description of your method in
writeup.pdf, making sure to describe the vector space model you chose, the clustering algorithm you used, and the results of any preliminary experiments you might have run on the dev set. We have provided a LaTeX file shell,
writeup.tex, which you can use to guide your writeup.
Finally, we’d like to see if dense word embeddings are better for clustering the words in our test set. Run the word clustering task again, but this time use a dense word representation.
For this task, use files:
GoogleNews-vectors-negative300.filter, which is filtered to contain only the words in the dev/test splits.
vectorcluster.pyto load dense vectors.
The baseline system for this section uses the provided word vectors to represent words, and K-means for clustering.
As before, achieving the baseline score will get you a B, but you might try to see if you can do better. Here are some ideas:
gensim.models.Word2Vecpackage for the skip-gram or CBOW models, or GLOVE. Try experimenting with the dimensionality.
As in question 2, turn in the predicted clusters that your dense vector representation generates in the file
test_output_dense.txt. Also provide a brief description of your method in
writeup.pdf that includes the vectors you used, and any experimental results you have from running your model on the dev set.
In addition, do an analysis of different errors made by each system – i.e. look at instances that the word-context matrix representation gets wrong and dense gets right, and vice versa, and see if there are any interesting patterns. There is no right answer for this.
In order to stir up some friendly competition, we would also like you to submit the clustering from your best model to a leaderboard. Copy the output file from your best model to a file called
test_output_leaderboard.txt, and include it with your submission.
We made the clustering problem deliberately easier by providing you with
k, the number of clusters, as an input. But in most clustering situations the best
k isn’t obvious.
To take this assignment one step further, see if you can come up with a way to automatically choose
k. We have provided an additional test set,
test_nok_input.txt, where the
k field has been zeroed out. See if you can come up with a method that clusters words by sense, and chooses the best
k on its own. (Don’t look at the number of WordNet synsets for this, as that would ruin all the fun.) The baseline system for this portion always chooses
You can submit your output to this part in a file called
test_nok_output_leaderboard.txt. Be sure to describe your method in
Here are the deliverables that you will need to submit:
question1.txtfile with answers to questions from Exploration
test_output_leaderboard.txt(this will probably be a copy of either
test_nok_output_leaderboard.txt(submit this to the Gradescope assignment ‘Homework 4 EXTRA CREDIT’)
|Vector Semantics. Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd edition draft) .|
|Semantics with Dense Vectors. Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd edition draft) .|
Efficient Estimation of Word Representations in Vector Space.
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean.
Linguistic Regularities in Continuous Space Word Representations.
Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig.
Discovering Word Senses from Text.
Patrick Pangel and Dekang Ling.
Clustering Paraphrases by Word Sense.
Anne Cocos and Chris Callison-Burch.
60 points total. Example baseline implementations are available here.