The costs of mental and neurological health problems in the United States are staggering, estimated at over $760 billion per year. One in five people in the U.S. experience a mental health problem in a given year, and mental health and substance abuse disorders are some of the leasing causes of disability worldwide .
Language plays an important role in diagnosing many mental and neurological health problems. This is where we come in. Our aim is to find out whether we can reliably identify people with mental health issues (clinical depression and Post-Traumantic Stress Disorder (PTSD)) based on their Twitter activity. If so, this technology could lead to inexpensive screening measures to be employed by healthcare professionals.
Here are the materials that you should download for this assignment:
Important Note about Data
This homework assignment involves sensitive mental health data. Therefore in order to obtain the dataset, you will need to complete CITI Human Subjects Training (~30 mins) and sign a release form. See
data/README.md for details.
Our goal is this: Given a set of a person’s tweets (and their associated metadata), predict whether that person suffers from
PTSD, or neither of these (
control). Specifically, we will create three models – one for each pairwise distinction:
Each model takes as input a pair of conditions to distinguish between (we’ll refer to them as
conditionNEG), and a list of people actually having either
conditionNEG. The model should re-rank the list of people such that those most likely to have
conditionPOS are at the top, and those most likely to have
conditionNEG are at the bottom.
The evaluation metric we will use is called average precision:
where is the set of relevant people (i.e. people belonging to
conditionPOS), is the total number of people being ranked, gives the precision at the -th person, and is an indicator variable denoting whether the person ranked at position belongs to
conditionPOS. For example, imagine there are 5 ranked people with conditions as follows:
conditionPOS conditionNEG conditionPOS conditionNEG conditionPOS
conditionPOS people have been ranked at positions 1, 3, and 5. The average precision, then, would be:
The dataset consists of a meta file and train and test directories. The meta file is a
.csv with information about each user, including an (anonymized) twitter handle, age, gender, tweet count, and experimental condition (
control). Each user also has an assigned chunk index, which refers to the location of that user’s data in the train or test directory.
The tweets themselves are contained in the train and test directories. Each directory contains numbered chunks of tweets, with each chunk having tweets from a list of users. Within the test directory, we have partitioned the test data into a dev and test set, with the dev set consisting of chunks 60-65 and test data in chunks 66-89.
We have provided a simple interface for you to access the data in
The above function returns a list with a tuple pertaining to each user, consisting of the user’s anonymized handle, metadata, and bytestring of tweets in JSON format. Here’s an example for a user with just a single tweet:
You can access a particular element of tweets for a given user using the helper function
util.get_tweets_element. For example, to extract the text of each tweet from the same user, run:
For this homework, we’d like you to write your own evaluation script. We’ve provided starter code in
eval.py. Your task is to implement the Average Precision metric described above.
In order to validate that your implementation is correct, we provide a sample output file (
control users from the dev set in random order. Running the following command:
python3 eval.py ptsd control ../output/random/dev_ptsd_control.txt
should get you a score of 0.3758.
Our baseline follows the approach proposed by Coppersmith et al. . We will utilize an character-level language model like those we implemented in Homework 5. We will build one language model for each condition:
control. Depending on the distinction we are trying to model (e.g.
conditionNEG), we use two of the three language models to obtain a confidence score for each user.
conditionNEG: the output of the code will rank a given list of users in terms of most confident to have
conditionPOS to least confident to have
conditionPOS (i.e. most confident to have
Once you have downloaded the data, the first step is to create a training corpus corresponding to each condition. Each corpus should contain all tweets from persons in the training set having the indicated condition.
We have provided a script to do enable you to do this:
By the end of this section, you should have three text files:
ptsd_text.txt located in
As you probably remember from Homework 5, the language model helps us “characterize” a body of training text by recording the probability of a letter following a history. For the purposes of achieving a fast runtime, the baseline LM uses an order of 1. This performs sufficiently well, but you are also welcome to train a higher-order model.
Since you’ve already built a smoothed character-level language model in the past, we’re providing you with the code in
Your task is to use the language model script
train_lm.py to generate language models for each of the three target conditions. Write the three models to a location where you can access them later.
Given language models for
conditionNEG, along with a set of a person’s tweets, our goal is now to score that person based on how much more their tweet text aligns with the LM for
conditionPOS as opposed to the LM for
conditionNEG. To do this we will use a method proposed from Coppersmith et al.:
where is the list of characters in a tweet, and gives the log probability of the tweet under language model .
NOTE: to speed up the runtime, our baseline implementation calculates each user’s score based on every 10th tweet and takes the median.
Your task is to complete the code skeleton for the function
score_subjects provided in
predict_lm.py. Once you’re finished, you will have a script that you can call in the following manner to produce a ranked list of users in the selected dataset, given language model files produced by
python3 predict_lm.py <SPLIT> <CONDITION_POS_MODEL> <CONDITION_NEG_MODEL> <OUTFILE>
Use this script to create outputs for the dev and test sets for the following expriments:
Without extensions, the baseline average precision you should be able to achieve with the dev data is:
Now that you have met the baseline scores, it’s time to build a better model. Use the skeleton provided in
predict_ext.py to implement your extended model.
Some ideas for how you might improve on the existing language model implementation include:
Please explain in the writeup why you chose to implement the extension you did, and what quantitative result you obtained with the extension.
Here are the deliverables that you will need to submit:
depression. These should be produced using your best extended model.
README.mdfile that explains how to run your extended model
|Proceedings of the Fourth Workshop on Computaitonal Linguistics and Clinical Psychology (CLPsych) -- From Linguistic Signal to Clinical Reality . CLPsych 2015.|
|CLPsych 2015 Shared Task: Depression and PTSD on Twitter Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead, Margaret Mitchell. CLPsych 2015.|
|From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Diagnoses Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead. CLPsych 2015.|
|Quantifying Mental Health Signals in Twitter Glen Coppersmith, Mark Dredze, Craig Harman. CLPsych 2014.|
|Measuring Post Traumatic Stress Disorder in Twitter Glen Coppersmith, Craig Harman, Mark Dredze. ICWSM 2014.|