The ability to automatically infer the semantic similarity between two pieces of text is of utmost importance in natural language processing. While the semantic similarity at a lexical level, i.e. between words has been widely studied (Eg. word-net, word-vectors), inferring semantic similarity for longer pieces of text is still fairly under-researched.
In this assignment, we will try to design a machine learning model for measuring semantic similarity between two given sentences in English. Along with understanding lexical-level similarity, this also requires modeling how syntactic composition affects the semantics of a sentence, even though at the lexical level the sentences might be very similar. Consider the following examples,
Sentence 1: Birdie is washing itself in the water basin.
Sentence 2: The bird is bathing in the sink.
The two sentences have very high semantic similarity, even though the lexical match is very low
English: The young lady enjoys listening to the guitar.
English Paraphrase: The young lady enjoys playing the guitar.
The two sentences have very low semantic similarity, even though the lexical match is very high
Here are the materials that you should download for this assignment:
data.zip
: Contains the data for STS-2017.evaluate.py
: Contains the evaluation script for the assignmentIn this assignment, we will make use of the data provided in the Semantic Textual Similarity (STS) - 2017 shared task.
Similarity Measure: The semantic similarity in this dataset is measured on a scale of 0-5 where 0 indicates that the semantics of the sentences are completely independent and 5 signifies semantic equivalence.
Data: We provide training/validation/test splits for the data containing 13365/1500/250 samples, respectively.
Evaluation Metric: Performance is assessed by computing the Pearson correlation between machine assigned semantic similarity scores and human judgements.
In addition to the provided data, you are free to use any supervised or unsupervised data. One possible resource is the plethora of available data from shared tasks in STS from at least the last 5 years.
Suggested by models submitted to shared task - STS 2017, a simple unsupervised baseline can be the cosine similarity between the bag-of-words representation sentence. You need to implement this simple baseline as a sanity check. This should reach a Pearson correlation of about 0.63 on the test set.
For the next part of the assignment, you need to develop two models for predicting STS. The team that did this as a course project implemented the model from DT_Team and report a Pearson correlation score of ~0.69. Their best model yielded a score of 0.81.
You can implement the same model, or come up with new ideas to improve the performance.
You need to submit two things:
writeup.pdf
explaining in detail the models implementedcode.zip
containing the code with a comprehensible READMEHere is a list of numerous submissions to the STS task. Feel free to borrow ideas from here.