Since you have read Jurafsky and Martin chapter 9, you know that Named Entity Recognition is the task of finding and classifying named entities in text. This task is often considered a sequence tagging task, like part of speech tagging, where words form a sequence through time, and each word is given a tag. Unlike part of speech tagging however, NER usually uses a relatively small number of tags, where the vast majority of words are tagged with the ‘non-entity’ tag, or O tag.
Your task is to implement your own named entity recognizer. Relax, you’ll find it’s a lot easier than it sounds, and it should be very satisfying to accomplish this. You will implement an entity tagger using scikit learn, filling out the stub that we give you. There will be a leaderboard.
As with nearly all NLP tasks, you will find that the two big points of variability in NER are (a) the features, and (b) the learning algorithm, with the features arguably being the more important of the two. The point of this assignment is for you to think about and experiment with both of these. Are there interesting features you can use? What latent signal might be important for NER? What have you learned in the class so far that can be brought to bear?
Get a headstart on common NER features by looking at Figure 21.5 in the textbook.
Here are the materials that you should download for this assignment:
The data we use comes from the Conference on Natural Language Learning (CoNLL) 2002 shared task of named entity recognition for Spanish and Dutch. The introductory paper to the shared task will be of immense help to you, and you should definitely read it. You may also find the original shared task page helpful. We will use the Spanish corpus (although you are welcome to try out Dutch too).
The tagset is:
The data uses BIO encoding (called IOB in the textbook), which means that each named entity tag is prefixed with a
B-, which means beginning, or an
I-, which means inside. So, for a multiword entity, like “James Earle Jones”, the first token “James” would be tagged with “B-PER”, and each subsequent token is “I-PER”. The O tag is for non-entities.
We strongly recommend that you study the training and dev data (no one’s going to stop you from examining the test data, but for the integrity of your model, it’s best to not look at it). Are there idiosyncracies in the data? Are there patterns you can exploit with cool features? Are there obvious signals that identify names? For example, in some Turkish writing, there is a tradition of putting an apostrophe between a named entity and the morphology attached to it. A feature of
isApostrophePresent() goes a long way. Of course, in English and several other languages, capitalization is a hugely important feature. In some African languages, there are certain words that always precede city names.
You will be glad to hear that the data is a mercifully small download. See the NLTK data page for for download options, but one way to get the conll2002 data is:
$ python -m nltk.downloader conll2002
There are two common ways of evaluating NER systems: phrase-based, and token-based. In phrase-based, the more common of the two, a system must predict the entire span correctly for each name. For example, say we have text containing “James Earle Jones”, and our system predicts “[PER James Earle] Jones”. Phrase-based gives no credit for this because it missed “Jones”, whereas token-based would give partial credit for correctly identifying “James” and “Earle” as B-PER and I-PER respectively. We will use phrase-based to report scores.
The output of your code must be
word gold pred, as in:
La B-LOC B-LOC Coruña I-LOC I-LOC , O O 23 O O may O O ( O O EFECOM B-ORG B-ORG ) O O . O O
Here’s how to get scores (assuming the above format is in a file called
# Phrase-based score $ python conlleval.py results.txt
Please create this output for the training set (as
train_results.txt), development set as (
dev_results.txt), and test set (as
test_results.txt). You can retrieve the sentences with the following code:
train_sents = list(conll2002.iob_sents('esp.train')) dev_sents = list(conll2002.iob_sents('esp.testa')) test_sents = list(conll2002.iob_sents('esp.testb'))
(The python version of conlleval doesn’t calculate the token-based score, but if you really want it, you can use the original perl version. You would use the
The version we have given you gets about 49% F1 right out of the box. We made some very simple modifications, and got it to 60%. This is a generous baseline that any thoughtful model should be able to beat. The state of the art on the Spanish dataset is about 85%. If you manage to beat that, then look for conference deadlines and start writing, because you can publish it.
In order to earn an A, demonstrate that you have thought about the problem carefully, and come up with solutions beyond what was strictly required.
import pickle from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, Y_train) pickle.dump(model, open('model', 'wb')) loaded_model = pickle.load(open(filename, 'rb'))
Here are the deliverables that you will need to submit: