The costs of mental and neurological health problems in the United States are staggering, estimated at over $760 billion per year. One in five people in the U.S. experience a mental health problem in a given year, and mental health and substance abuse disorders are some of the leasing causes of disability worldwide [1].
Language plays an important role in diagnosing many mental and neurological health problems. This is where we come in. Our aim is to find out whether we can reliably identify people with mental health issues (clinical depression and Post-Traumantic Stress Disorder (PTSD)) based on their Twitter activity. If so, this technology could lead to inexpensive screening measures to be employed by healthcare professionals.
Here are the materials that you should download for this assignment:
Important Note about Data
This homework assignment involves sensitive mental health data. Therefore in order to obtain the dataset, you will need to complete CITI Human Subjects Training (~30 mins) and sign a release form. To do so, please following the following steps:
If you’re working in pairs, both partners need to follow the above steps.
The instructor will then send you the data.
Our goal is this: Given a set of a person’s tweets (and their associated metadata), predict whether that person suffers from depression
, PTSD
, or neither of these (control
). Specifically, we will create three models – one for each pairwise distinction:
PTSD
vs. control
depression
vs. control
PTSD
vs. depression
Each model takes as input a pair of conditions to distinguish between (we’ll refer to them as conditionPOS
and conditionNEG
), and a list of people actually having either conditionPOS
or conditionNEG
. The model should re-rank the list of people such that those most likely to have conditionPOS
are at the top, and those most likely to have conditionNEG
are at the bottom.
The evaluation metric we will use is called average precision:
\[AP = \frac{1}{|R|} \cdot \sum_{i=1}^n prec(i) \cdot relevance(i)\]where \(R\) is the set of relevant people (i.e. people belonging to conditionPOS
), \(n\) is the total number of people being ranked, \(prec(i)\) gives the precision at the \(i\)-th person, and \(relevance(i)\) is an indicator variable denoting whether the person ranked at position \(i\) belongs to conditionPOS
. For example, imagine there are 5 ranked people with conditions as follows:
conditionPOS
conditionNEG
conditionPOS
conditionNEG
conditionPOS
i.e. the conditionPOS
people have been ranked at positions 1, 3, and 5. The average precision, then, would be:
The dataset consists of a meta file and train and test directories. The meta file is a .csv
with information about each user, including an (anonymized) twitter handle, age, gender, tweet count, and experimental condition (ptsd
, depression
, or control
). Each user also has an assigned chunk index, which refers to the location of that user’s data in the train or test directory.
The tweets themselves are contained in the train and test directories. Each directory contains numbered chunks of tweets, with each chunk having tweets from a list of users. Within the test directory, we have partitioned the test data into a dev and test set, with the dev set consisting of chunks 60-65 and test data in chunks 66-89.
We have provided a simple interface for you to access the data in util.py
:
The above function returns a list with a tuple pertaining to each user, consisting of the user’s anonymized handle, metadata, and bytestring of tweets in JSON format. Here’s an example for a user with just a single tweet:
You can access a particular element of tweets for a given user using the helper function util.get_tweets_element
. For example, to extract the text of each tweet from the same user, run:
For this homework, we’d like you to write the evaluation metric in the evaluation script. We’ve provided starter code in eval.py
. Your task is to implement the Average Precision metric described above.
In order to validate that your implementation is correct, we provide a sample output file (output/random/dev_ptsd_control.txt
) with PTSD
and control
users from the dev set in random order. Running the following command:
python3 eval.py ptsd control ../output/random/dev_ptsd_control.txt
should get you a score of 0.3758.
Our baseline follows the approach proposed by Coppersmith et al. [2]. We will utilize an character-level language model like those we implemented in Homework 3. We will build one language model for each condition: PTSD
, depression
, and control
. Depending on the distinction we are trying to model (e.g. conditionPOS
vs conditionNEG
), we use two of the three language models to obtain a confidence score for each user.
For task conditionPOS
vs conditionNEG
: the output of the code will rank a given list of users in terms of most confident to have conditionPOS
to least confident to have conditionPOS
(i.e. most confident to have conditionNEG
). All the details for performing these tasks are described below step-by-step:
Once you have downloaded the data, the first step is to create a training corpus corresponding to each condition. Each corpus should contain all tweets from persons in the training set having the indicated condition.
We have provided a script to enable you to do this:
python3 generate_lm_training_text.py
By the end of this section, you should have three text files: control_text.txt
, depression_text.txt
, and ptsd_text.txt
located in data/lm-training-text
.
As you probably remember from Homework 3, the language model helps us “characterize” a body of training text by recording the probability of a letter following a history. For the purposes of achieving a fast runtime, the baseline LM uses an order of 1. This performs sufficiently well, but you are also welcome to train a higher-order model.
Since you’ve already built a smoothed character-level language model in the past, we’re providing you with the code in train_lm.py
.
Your task is to use the language model script train_lm.py
to generate language models for each of the three target conditions. You can output the three models in models
folder in the provided code skeleton or to a location where you can access them later.
For instance, to build the model for depression (depression_model), you can use the following command:
python3 train_lm.py ../data/lm-training-text/depression_text.txt ../models/depression_model
Given language models for conditionPOS
and conditionNEG
, along with a set of a person’s tweets, our goal is now to score that person based on how much more their tweet text aligns with the LM for conditionPOS
as opposed to the LM for conditionNEG
. To do this we will use a method proposed from Coppersmith et al.:
where \(C\) is the list of characters in a tweet, and \(\text{log}p(c_{X})\) gives the log probability of the tweet under language model \(X\).
Your task is to complete the code skeleton for the function score_subjects
provided in predict_lm.py
. Once you’re finished, you will have a script that you can call in the following manner to produce a ranked list of users in the selected dataset, given language model files produced by train_lm.py
:
python3 predict_lm.py <SPLIT> <CONDITION_POS_MODEL> <CONDITION_NEG_MODEL> <OUTFILE>
Use this script to create outputs for the dev and test data for the following experiments:
(conditionPOS
vs conditionNEG
)
ptsd
vs. control
depression
vs. control
ptsd
vs. depression
Name the output files generated after running the above three experiments on dev data as : dev_ptsd_control.txt
, dev_depression_control.txt
, dev_ptsd_depression.txt
.
Similarly, for test data, expected output files: test_ptsd_control.txt
, test_depression_control.txt
, test_ptsd_depression.txt
.
Hint: To speed up the runtime, for each user, you can calculate the user’s score based on every 10th tweet and take the median in the end to get the final score for that user.
Now that you have met the baseline scores, it’s time to build a better model. Use the skeleton provided in predict_ext.py
to implement your extended model.
Some ideas for how you might improve on the existing language model implementation include:
Use your extended model script to create outputs for the dev and test data for the following same experiments:
(conditionPOS
vs conditionNEG
)
ptsd
vs. control
depression
vs. control
ptsd
vs. depression
Name the output files generated after running the above three experiments on dev data as : dev_ptsd_control_ext.txt
, dev_depression_control_ext.txt
, dev_ptsd_depression_ext.txt
.
Similarly, for test data, expected output files are test_ptsd_control_ext.txt
, test_depression_control_ext.txt
, test_ptsd_depression_ext.txt
.
Please explain in the writeup why you chose to implement the extension you did, and what quantitative result you obtained with the extension.
Here are the deliverables that you will need to submit:
writeup.pdf
to the HW9: Report Submission on gradescope:
It must contain
submission.zip
with a README
to run. The code should be written in Python3. You must include the outputs of your dev and test data for all the models as described below.submission
directory should contain atleast the following files. You can add more files if required.
You need to turn in the zipped folder submission.zip
on gradescope.
Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology (CLPsych) -- From Linguistic Signal to Clinical Reality . CLPsych 2015. |
CLPsych 2015 Shared Task: Depression and PTSD on Twitter Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead, Margaret Mitchell. CLPsych 2015. |
From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Diagnoses Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead. CLPsych 2015. |
Quantifying Mental Health Signals in Twitter Glen Coppersmith, Mark Dredze, Craig Harman. CLPsych 2014. |
Measuring Post Traumatic Stress Disorder in Twitter Glen Coppersmith, Craig Harman, Mark Dredze. ICWSM 2014. |