Since you have read Jurafsky and Martin chapter 21, you know that Named Entity Recognition is the task of finding and classifying named entities in text. This task is often considered a sequence tagging task, like part of speech tagging, where words form a sequence through time, and each word is given a tag. Unlike part of speech tagging however, NER usually uses a relatively small number of tags, where the vast majority of words are tagged with the ‘non-entity’ tag, or O tag.
Your task is to implement your own named entity recognizer. Relax, you’ll find it’s a lot easier than it sounds, and it should be very satisfying to accomplish this. There will be two versions of this task: the first, the constrained version, is the required entity tagger that you implement using scikit learn, filling out the stub that we give you. The second is an unconstrained optional version where you use whatever tool, technique, or feature you can get your hands to get the best possible score on the dataset. There will be a leaderboard for each version.
As with nearly all NLP tasks, you will find that the two big points of variability in NER are (a) the features, and (b) the learning algorithm, with the features arguably being the more important of the two. The point of this assignment is for you to think about and experiment with both of these. Are there interesting features you can use? What latent signal might be important for NER? What have you learned in the class so far that can be brought to bear?
Get a headstart on common NER features by looking at Figure 21.5 in the textbook.
Here are the materials that you should download for this assignment:
The data we use comes from the Conference on Natural Language Learning (CoNLL) 2002 shared task of named entity recognition for Spanish and Dutch. The introductory paper to the shared task will be of immense help to you, and you should definitely read it. You may also find the original shared task page helpful. We will use the Spanish corpus (although you are welcome to try out Dutch too).
The tagset is:
The data uses BIO encoding (called IOB in the textbook), which means that each named entity tag is prefixed with a
B-, which means beginning, or an
I-, which means inside. So, for a multiword entity, like “James Earle Jones”, the first token “James” would be tagged with “B-PER”, and each subsequent token is “I-PER”. The O tag is for non-entities.
We strongly recommend that you study the training and dev data (no one’s going to stop you from examining the test data, but for the integrity of your model, it’s best to not look at it). Are there idiosyncracies in the data? Are there patterns you can exploit with cool features? Are there obvious signals that identify names? For example, in some Turkish writing, there is a tradition of putting an apostrophe between a named entity and the morphology attached to it. A feature of
isApostrophePresent() goes a long way. Of course, in English and several other languages, capitalization is a hugely important feature. In some African languages, there are certain words that always precede city names.
You will be glad to hear that the data is a mercifully small download. See the NLTK data page for for download options, but one way to get the conll2002 data is:
$ python -m nltk.downloader conll2002
There are two common ways of evaluating NER systems: phrase-based, and token-based. In phrase-based, the more common of the two, a system must predict the entire span correctly for each name. For example, say we have text containing “James Earle Jones”, and our system predicts “[PER James Earle] Jones”. Phrase-based gives no credit for this because it missed “Jones”, whereas token-based would give partial credit for correctly identifying “James” and “Earle” as B-PER and I-PER respectively. We will use phrase-based to report scores.
The output of your code must be
word gold pred, as in:
La B-LOC B-LOC Coruña I-LOC I-LOC , O O 23 O O may O O ( O O EFECOM B-ORG B-ORG ) O O . O O
Here’s how to get scores (assuming the above format is in a file called
# Phrase-based score $ python conlleval.py results.txt
(The python version of conlleval doesn’t calculate the token-based score, but if you really want it, you can use the original perl version. You would use the
Here are some other NER frameworks which you are welcome to run in the unconstrained version:
Note: you are not allowed to use pre-trained NER models even in the unconstrained version. Please train your own. You are allowed to use pre-trained embeddings.
The version we have given you gets about 49% F1 right out of the box. We made some very simple modifications, and got it to 60%. This is a generous baseline that any thoughtful model should be able to beat. The state of the art on the Spanish dataset is about 85%. If you manage to beat that, then look for conference deadlines and start writing, because you can publish it.
As always, beating the baseline alone with earn you a B on the project. In order to earn an A, demonstrate that you have thought about the problem carefully, and come up with solutions beyond what was strictly required. Extra credit for the top of the leaderboard etc.
Here are the deliverables that you will need to submit: