# Perspective Relevance and Stance Classificaiton
## CIS 530 Homework 11 - Spring 2020
Arguments play an important role in understanding controversial topics. For instance, watching debates over an controversial topic is arguably the most efficient way of learning about different perspectives on the matter. However, in real life, information around a topic (e.g. from news publishers) is usually organized in a limited and repetitive way, such that one will not be able to see a variety of perspectives from a diverse background.

With the goal of "showing diverse persepctives with respect to a controversial topic", one of your TAs built an argument search engine called [PerspectroScope](https://perspectroscope.seas.upenn.edu/). Given a controversial claim as input, the search engine will look for potential arguments on the open web, and use classifiers trained on a dataset called [Perspectrum](https://cogcomp.seas.upenn.edu/perspectrum/), to decide whether each potential argument is indeed relevant and is supporting/refuting the claim. 

In this homework, we will be using BERT, A powerful and popular Contextual Neural Language Model, to tackle two sentence pair classification tasks that constituates the "PerspectroScope" argument search engine.
1. Given a claim and an sentence, classify whether the sentence presents a **relevant perspective** to the claim.
2. Given a claim and a sentence of relevant perspective, classify whether the perspective **supports or refutes** the claim.




## **Part I:** Relevance Classification with BERT fine-tuning

#### But first...What is fine-tuning? 
Fine-tuning is a process to take a machine learning model that has already been trained for a given task/objective, and further train the model with a second similar task/objective. 

#### Why do we need to fine-tune BERT?
BERT is trained with the Masked Language Modeling objective -- a similar but slightly more sophisticated learning objective to the neural language models you have previously seen in class. Here is a nice [demo](https://demo.allennlp.org/masked-lm?text=The%20doctor%20ran%20to%20the%20emergency%20room%20to%20see%20%5BMASK%5D%20patient.) that demonstrates how it works. Your TA also built [a more powerful but less fancy version](http://dickens.seas.upenn.edu:4001/%20%20/I%20love%20%40%20chocolate/perToken). In case you are interested in seeing how BERT works, please try these two demos out. 

The takeway is that, the Masked LM objective gives BERT the capability of general "language understanding". However, without fine-tuning, BERT is not equipped with the "domain knowledge" of the speicific tasks you are interested in solving. So you need to initialize your model with pretrained BERT embedding, and further train it with labeled data specific to the task. 

So far everything looks very similar to the Neural LM homework, where you initialize the model with pretrained word embeddings (e.g. GloVe, Skip-gram). But here's one important difference, which is also one of the reasons why BERT is so powerful: When you fine-tune BERT, you not only fine-tune the last layers, but the entire BERT model's weights are updated. On contrary, when you use word embedding, you don't actually update the part of the network that was used to train the word embeddings. This makes BERT more expressive, and easier to adapt to the task-specific supervision provided during fine-tuning. 

We will be using the [transformer](https://github.com/huggingface/transformers) package developed by Huggingface, based on PyTorch. It is the most popular library for BERT and other transformer-based language models like GPT-2. 


**IMPORTANT: Make sure that you have GPU set as your Hardware Accelerator in Runtime > Change runtime type before running this Colab.**

### Installing the HuggingfaceðŸ¤— transformer package + other required packages

In [0]:
import os

!git clone https://github.com/huggingface/transformers
os.chdir('/content/transformers')
!pip install .
!pip install -r ./examples/requirements.txt
!pip install tqdm

### Import the important packages that we need

In [0]:
import torch 
import numpy as np

### Mount your google drive 

We will be saving trained checkpoints on your Google Drive so that they can be accessed even if the Colab session dies. Make sure to login with your UPenn credentials, as you will be saving several gigabytes of data, and Penn gives you unlimited Drive storage.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

### Download the PERSPECTRUM dataset
Note that with the default code, the files are not saved in your google drive, which means they will get deleted after the session close. You can either re-run this cell for each new colab session, or you can save it to the mounted drive at `/content/drive`

In [0]:
dataset_dir = '/content/'

# Perspectrum Training Set
!wget -nc -P {dataset_dir} https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/perspectives/perspectrum_train.json

# Perspectrum Relevance - Dev/Test set
!wget -nc -P {dataset_dir} https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/perspectives/perspectrum_relevance_dev.json
!wget -nc -P {dataset_dir} https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/perspectives/perspectrum_relevance_test_no_label.json

# Perspectrum Stance - Dev/Test set
!wget -nc -P {dataset_dir} https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/perspectives/perspectrum_stance_dev.json
!wget -nc -P {dataset_dir} https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/perspectives/perspectrum_stance_test_no_label.json

### Load the dataset and see what it looks like

For now let's first load the training dataset and see what it looks like. We will worry about the dev/test sets later...

**Note**: In practice you should NEVER look at the training data you are working with, when there's a dev set available. This is to prevent one from designing models that leverage the "dataset artifacts" of the training data.  However in this case, we've done negative sampling for you on dev/test set, so they are already in a different format. 

In [0]:
import json
import os

with open(os.path.join(dataset_dir, 'perspectrum_train.json')) as fin:
    train_set = json.load(fin)

print(type(train_set))
print("Number of claims in training set: {}".format(len(train_set)))
print("Here's how one of the example looks like: {}".format(json.dumps(train_set[100])))

So the training set contains a list of claims. Under each claim, there are a few relavant "perspectives" either support or refute the claim, which will serve as **positive examples** for training. Note that **we don't have negative examples provided**. This will be the case for most datasets you meet in real life. 

Now let's randomly sample negative perspectives (i.e. perspectives NOT related to the given claim). 

In [0]:
import random

def negative_sample(train_set, claim_id, claim_text, sample_size):
    """
    Given a perspective (A dictionnary with keys "id" and "text"), randomly sample {sample_size} negative examples from the dataset. E.g. get a perspective from a different claim
    """
    # Each perspective object in the list should be a dictionary with two keys "id", "text".
    other_examples = [ex for ex in train_set if ex["cid"] != claim_id]
    
    negative_examples = []
    for i in range(sample_size):
        rand_claim = random.choice(other_examples)
        all_persps = rand_claim["perspective_for"] + rand_claim["perspective_against"]
        random_persp = random.choice(all_persps)
        negative_examples.append(random_persp)
    
    return negative_examples

training_sentence_pairs = []

for claim in train_set:
    positive_perspectives = claim["perspective_for"] + claim["perspective_against"]
    
    # We keep the number of negative examples equal to positive, so that we will have a balanced training set
    negative_perspectives = negative_sample(train_set, claim['cid'], claim['claim_text'], len(positive_perspectives)) 
    
    for persp in positive_perspectives:
        training_sentence_pairs.append({
            "claim_id": claim["cid"],
            "claim_text": claim["claim_text"],
            "perspective_id": persp["id"],
            "perspective_text": persp["text"],
            "label": True
        })

    for persp in negative_perspectives:
        training_sentence_pairs.append({
            "claim_id": claim["cid"],
            "claim_text": claim["claim_text"],
            "perspective_id": persp["id"],
            "perspective_text": persp["text"],
            "label": False
        })

print("Number of claim-perspective sentence pairs for training: {}".format(len(training_sentence_pairs)))

Now it would be a good time to load our dev/test examples, which are already organized in the same sentence pair format as what you just did.

In [0]:
with open(os.path.join(dataset_dir, 'perspectrum_relevance_dev.json')) as fin:
    dev_sentence_pairs = json.load(fin)

with open(os.path.join(dataset_dir, 'perspectrum_relevance_test_no_label.json')) as fin:
    test_sentence_pairs = json.load(fin)

print("Number of claim-perspective sentence pairs in dev set: {}".format(len(dev_sentence_pairs)))
print("Number of claim-perspective sentence pairs in test set: {}".format(len(test_sentence_pairs)))

### Load Pretrained BERT Model
For the sake of running time and memory limit, we will be using a mini version of BERT, which consists of 4 transformer layers (as opposed to 12 in the base version of BERT). For the leaderboard, feel free to use a larger size BERT to achieve better performance. 

You can search for the available models [here](https://huggingface.co/models?search=bert).

You can find more examples of different use cases for BERT in the transformer github repo README -- https://github.com/huggingface/transformers


In [0]:
from transformers import InputExample
from transformers import (WEIGHTS_NAME, BertConfig,
                          BertForSequenceClassification, BertTokenizer)
from transformers import glue_convert_examples_to_features as convert_examples_to_features
from transformers.optimization import AdamW, get_linear_schedule_with_warmup
import tqdm

from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)

bert_model_type = 'google/bert_uncased_L-4_H-256_A-4'   # Specs of BERT models with different sizes can be found at https://github.com/google-research/bert/
                                                        # You can experiment models with different sizes, to see how it affects performance. 

bert_model = BertForSequenceClassification.from_pretrained(bert_model_type)
config = BertConfig.from_pretrained(bert_model_type)
tokenizer = BertTokenizer.from_pretrained(bert_model_type)

### Convert examples to BERT input features
Much like every other neural network. You need to (1) tokenize your input sentences (2) Have a vocabulary/dictionary and convert each token to a vector/tensor. Luckily BERT offers a very nice set of interfaces, through which you can do these steps easily.

In this homework we provide this function to you. However, in case you would like to use BERT in the future, it is really important to understand BERT's input format and the word-piece tokenziation strategy that BERT adopts. Here are a few resources that we suggest -- 

1. The ["What is BERT" section](https://github.com/google-research/bert#what-is-bert) in the official BERT code repo by Google
2. Section 3 of the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf)


In [0]:
relevance_label_mapping = {
    True: 1,
    False: 0
} # If you are working on stance classification, create a different label mapping

def convert_sentence_pair_to_tensor_input(sentence_pairs, label_mapping):

    # STEP 1: convert each sentence 
    input_examples = []
    for pair in sentence_pairs:
        current_label = pair["label"] if "label" in pair else False
        input_examples.append(
            InputExample(guid="", # We don't really need this
                         text_a=pair["claim_text"], 
                         text_b=pair["perspective_text"], 
                         label=label_mapping[current_label])
        )

    label_list = [val for _, val in label_mapping.items()]

    features = convert_examples_to_features(input_examples,
                                                   tokenizer,
                                                   label_list=label_list,
                                                   max_length=128,  
                                                   output_mode="classification")
    
    input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    labels = torch.tensor([f.label for f in features], dtype=torch.long)

    dataset = TensorDataset(input_ids, attention_mask, token_type_ids, labels)

    return dataset

In [0]:
train_dataset = convert_sentence_pair_to_tensor_input(training_sentence_pairs, relevance_label_mapping)

### Choose your hyperparameters + model output directory
Before we get into training, we need to set our hyperparameters, e.g. Learning rates, mini-batch size for training/testing, etc..

In [0]:
HYPER_PARAMS = {
    "num_training_epoch": 3,
    "learning_rate": 3e-5,        # Suggested values -- [1e-5, 3e-5, 5e-5]
    "training_batch_size": 16,    # Suggested values -- [16, 32]
    "eval_batch_size": 8,
    "max_grad_norm": 1.0,
    "num_warmup_steps": 0.1
}

model_output_dir = "/content/drive/" # Model + prediction results will be saved to your GDrive, 
                                     # so you don't lose them after session closes

### Fine-tune BERT model

Remember NOT to re-run this cell multiple times, without re-initializing the BERT model. Multiple runs will effectively train your model with more epochs than you intended!

In [0]:
import tqdm

bert_model.to('cuda')

train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, 
                              sampler=train_sampler, 
                              batch_size=HYPER_PARAMS["training_batch_size"])

optimizer = AdamW(bert_model.parameters(), 
                  lr=HYPER_PARAMS['learning_rate'], 
                  correct_bias=False)

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=HYPER_PARAMS['num_warmup_steps'], 
                                            num_training_steps=len(train_dataloader))


global_step = 0
tr_loss = 0.0
bert_model.zero_grad()
bert_model.train()

for epc in range(HYPER_PARAMS["num_training_epoch"]):
    print("Epoch #{}: \n".format(epc))
    epoch_iterator = tqdm.notebook.tqdm(train_dataloader, desc="Training Steps")
    avg_loss_over_epoch = []
    for step, batch in enumerate(epoch_iterator):
        batch = tuple(t.to("cuda") for t in batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'token_type_ids': batch[2],
                  'labels': batch[3]}

        outputs = bert_model(**inputs)
        loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(bert_model.parameters(), HYPER_PARAMS["max_grad_norm"])
        tr_loss += loss.item()

        optimizer.step()
        scheduler.step()
        bert_model.zero_grad()

### Save the fine-tuned model
It is good practice to save your tokenizer + config for BERT at the same location, for best reproducibility

In [0]:
import os

# This is where we mounted your google drive. 
# You might need to re-mount it if your session was closed half way through
output_dir = "/content/drive/My Drive/cis530_perspective_hw/relevance_model/" 

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

bert_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
config.save_pretrained(output_dir)

### Test if you can load the model back!

In [0]:
bert_model = BertForSequenceClassification.from_pretrained(output_dir)
tokenizer = BertTokenizer.from_pretrained(output_dir)

# Don't forget to move your model to GPU/CUDA after loading back from disk!
bert_model = bert_model.to("cuda")

### Evaluate the fine-tuned model on dev set
Now we want to know how good our model is. Let's test it on the dev set!

We need to go through the same process -- convert sentence pairs into feature vectors/tensors

In [0]:
# Putting this here again, just so you don't forget what it is...
relevance_label_mapping = {
    True: 1,
    False: 0
} 

dev_dataset = convert_sentence_pair_to_tensor_input(dev_sentence_pairs, relevance_label_mapping)

# We are not random sampling anymore when evaluating... As we want to keep the order 
dev_sampler = SequentialSampler(dev_dataset)
dev_dataloader = DataLoader(dev_dataset, 
                            sampler=dev_sampler, 
                            batch_size=HYPER_PARAMS["eval_batch_size"])

predictions = None
out_label_ids = None

for batch in tqdm.notebook.tqdm(dev_dataloader, desc="Evaluating on Dev set..."):
    bert_model.eval()
    batch = tuple(t.to("cuda") for t in batch)
    inputs = {'input_ids': batch[0],
              'attention_mask': batch[1],
              'token_type_ids': batch[2],
              'labels': batch[3]}

    with torch.no_grad():
        outputs = bert_model(**inputs)
        logits = outputs[1] # This is 1x2 tensor, containing scores for both labels 

    if predictions is None:
        predictions = logits.detach().cpu().numpy()
        out_label_ids = inputs['labels'].detach().cpu().numpy()
    else:
        predictions = np.append(predictions, logits.detach().cpu().numpy(), axis=0)
        out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)

# whichever label gets higher score, we will predict that label
predictions = np.argmax(predictions, axis=1)


# We will simply use accuracy as our measure here 
def accuracy(preds, labels):
    return (preds == labels).mean()

acc = accuracy(predictions, out_label_ids)

print("The accuracy on dev set = {}".format(acc))

The TAs were able to get around 75-80% accuracy on the dev set, with the provided set of parameters and model. 

### Now it's your turn - Evaluate on the test data, and submit your results

**Important Note**: the labels of the test data are NOT given to you in this homework. However the helper functions will still generate a dummy label for each input sentence pair. The only way to measure the correct accuracy on test set is submitting your test results `relevance_test_predictions.txt` to Gradescope. 

Other than that this should be almost identical to what we just did for the dev set.

Please download the `relevance_test_predictions.txt` and follow guide on the homework webpage to make a submission.

In [0]:
def predict_on_test_set():
    """
    Return a list of 0/1 prediction for each test example, in sequential order.
    Please use the same label mapping as we have so far.
    1 = True (Relevant)
    0 = False (Not relevant)
    """

    test_dataset = convert_sentence_pair_to_tensor_input(test_sentence_pairs, relevance_label_mapping)
    
    # TODO: fill the rest here
    
    return list_of_predictions


# Feel free to change the save location as you like,
# but please keep the file name as "relevance_test_predictions.txt"
# So that the autograder will know what file to look for...
test_result_output_path = "/content/drive/My Drive/cis530_perspective_hw/relevance_test_predictions.txt"

test_predictions = predict_on_test_set()

with open(test_result_output_path, 'w') as fout:
    for pred in test_predictions: 
        fout.write("{}\n".format(int(pred)))

## **Part II:** DIY for stance classification (Optional, Extra Credit)

Now that you are becoming an expert for BERT (hopefully), why don't you try to tackle our second task -- stance classification, to predict whether a relevant perspective is eihter **supporting or refuting** the claim.

Since this is a different task, you will be generating positive and negative sentence pairs in a slightly different way. Sepcifically --

1.   In `perspectrum_train.json`, for each given claim, both supporting and refuting perspectives have been given to you. So you don't need to do negative sampling. Instead you should take the claim + "supporting" perspective as positive sentence pair and claim with "refuting" perspective as negative pair.   

2.   The task assumes that for every input claim-perspective pair, the perspective is relevant to the claim. So when generating training pairs, you should make sure of that.

But once you have generated sentence pairs from the training data, the training/evaluation procedure should be almost identical. For the most part you will be re-using code that we just went through.

### **What you need to submit**:
Like what we did for the perspective relevance classification, we want to you train a model and write your stance classification predictions on the test data to a file named `stance_test_predictions.txt`. 



In [0]:
with open(os.path.join(dataset_dir, 'perspectrum_train.json')) as fin:
    train_set = json.load(fin)

# TODO: start from here

In [0]:
# The dev and test sets are, again, made into sentence pairs format for you already
with open(os.path.join(dataset_dir, 'perspectrum_stance_dev.json')) as fin:
    dev_sentence_pairs = json.load(fin)

with open(os.path.join(dataset_dir, 'perspectrum_stance_test_no_label.json')) as fin:
    test_sentence_pairs = json.load(fin)

print("Number of claim-perspective sentence pairs in dev set: {}".format(len(dev_sentence_pairs)))
print("Number of claim-perspective sentence pairs in test set: {}".format(len(test_sentence_pairs)))

stance_label_mapping = {
    "support": 1,
    "refute": 0
} 

# TODO: start from here