This assignment is a continuation of last week’s assignment. We’ll turn from traditional n-gram based language models to a more advanced form of language modeling using a Recurrent Neural Network. Specifically, we’ll be setting up a character-level recurrent neural network (char-rnn) for short.
Andrej Karpathy, a researcher at OpenAI, has written an excellent blog post about using RNNs for language models, which you should read before beginning this assignment. The title of his blog post is The Unreasonable Effectiveness of Recurrent Neural Networks.
Karpathy shows how char-rnns can be used to generate texts for several fun domains:
In this assignment you will follow a Pytorch tutorial code to implement your own char-rnn, and then test it on a dataset of your choice. You will also train on our provided training set, and submit to the leaderboard.
Here are the materials that you should download for this assignment:
Pytorch is one of the most popular deep learning frameworks in both industry and academia, and learning its use will be invaluable should you choose a career in deep learning. You will be using Pytorch for this assignment, we ask you to build off a couple Pytorch tutorials.
notebook settingsin the
Please look at the FAQ section before you start working.
Read through the tutorial here that builds a char-rnn that is used to classify baby names by their country of origin. You can also build off the released code here. It is recommended that you can reproduce the tutorial’s results on the provided baby-name dataset before moving on.
Modify the tutorial code to instead read from the city names dataset that we used in the previous assignment. The tutorial code problematically used the same text file for both training and evaluation. We learned in class about how this is not a great idea. For the city names dataset we provide you separate train and validation sets, as well as a test file for the leaderboard.
All training should be done on the train set and all evaluation (including confusion matrices and accuracy reports) on the validation set. You will need to change the data processing code to get this working. In addition, to handle unicode, you might need to replace calls to
open with calls to
codecs.open(filename, "r",encoding='utf-8', errors='ignore').
Warning: you’ll want to lower the learning rating to 0.002 or less or you might get NaNs when training.
Attribution: the city names dataset is derived from Maxmind’s dataset.
Experimentation and Analysis
Complete the following analysis on the city names dataset, and include your finding in the report.
Write code to make predictions on the provided test set. The test set has one unlabeled city name per line. Your code should output a file
labels.txt with one two-letter country code per line. Extra credit will be given to the top 5 leaderboard submissions. Here are some ideas for improving your leaderboard performance:
In your report, describe your final model and training parameters.
In this section, you will be following more Pytorch tutorial code in order to reproduce Karpathy’s text generation results. Read through the tutorial here, and then download this ipython notebook to base your own code on.
You will notice that the code is quite similar to that of the classification problem. The biggest difference is in the loss function. For classification, we run the entire sequence through the RNN and then impose a loss only on the final class prediction. For the text generation task, we impose a loss at each step of the RNN on the predicted character. The classes in this second task are the possible characters to predict.
Be creative! Pick some dataset that interests you. Here are some ideas:
Include a sample of the text generated by your model, and give a qualitative discussion of the results. Where does it do well? Where does it seem to fail? Report perplexity on a couple validation texts that are similar and different to the training data. Compare your model’s results to that of an n-gram language model.
Here are the deliverables that you will need to submit:
labels.txtpredictions for leaderboard.
Use the command below. Please ensure that your model can be used for inference.
Use the command below.
model = CharRNNClassify() model.load_state_dict(torch.load(PATH)) model.eval() #To predict
If you are new to the paradigm of computational graphs and functional programming, please have a look at this tutorial before getting started.
jupyter nbconvert --to script notebook.ipynb
The TA’s model, which passed all the testcases, had the following configuration:
Send the model and the input, output tensors to the GPU using
.to(device). Refer the PyTorch docs for further information.
Noisy data is common when data is harvested automatically like the cities dataset. The onus is on the data scientist to ensure that their data is clean. However, for this assignment, you are not required to clean the dataset.
|Neural Nets and Neural Language Models. Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd edition draft) .|
|The Unreasonable Effectiveness of Recurrent Neural Networks. Andrej Karpathy. Blog post. 2015.|
|A Neural Probabilistic Language Model (longer JMLR version). Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin. Journal of Machine Learning Research 2003.|