At Smith, I double majored in Computer Science and Statistical & Data Sciences. This summer, I am working as a research assistant with Professor Ani Nenkova in Natural Language Processing. Specifically, my project is about annotating medical literature. You can read more about Professor Nenkova's work here.
You can follow along in my journey below!
You can read my final report here.
I am happy to find out that Professor Nenkova has two DREU interns this year so I won't be alone! Liz Conrad is the other intern and you can follow her journey here.
My first week at the internship has mostly involved sorting out logistics so we can get set up with our Penn IDs, get access to computers in the lab, and the grid that the NLP group at Penn uses. We also had a meeting with Prof. Nenkova and her graduate student (Oshin Agarwal) to discuss the big picture of the project we will be working on this summer. They introduced us to the dataset we will be working on which is the publicly available data from www.clinicaltrials.gov.
Outside of work, I spent most of this week figuring out my off-campus housing situation such as locating grocery stores and getting used to cooking for myself every night. My workplace is around a 0.7 mile walk from my apartment so I figured that would be good exercise for the summer!
Professor Nenkova assigned us several papers to read so that we are upto speed on the project. Two of those papers are about the project. They are hereandhere. She also assigned us some other papers about related work that she thought we'd find interesting. My project this summer will be building on the second paper.
We downloaded the data and starting digging through it this week. My task was to look at clinical trials that had a published study attached to them. Once I retrieved those trials, I comptuted how much overlap existed between the clinical trials and associated abstracts for studies on PUBMED. If there was overlap, that trial was viable data for my summer project!
Outside of work, I started exploring some of the offerings of West Philadelphia. I discovered this amazing bakery called Manakeesh that does $2.50 smoothies on Tuesdays!
This week I presented what I had so far to Professor Nenkova and we agreed that this was a viable direction to move in. We have outlined the direction of the project as following: We will improve the classifier that annotates medical literature into PICO (Participant, Intervention, Condition, Outcome) elements. We will be building on previous work by using new data made availanle from Clinical Trials. The data from Clinical Trials will be supplementing the manually annotated data we have as the training set. However, Clinical Trials data does not automatically lend itself to the same format as the manual annotations so we will first work on this aspect.
Liz and I have been meeting with Professor Nenkova twice a week. She also introduced us to the third student she will be working with this summer, though he is working on a separate project.
This week was pretty exciting because it was Pride in Philadelphia. I also visited the Barnes Foundating and the Fleisher Art Museum!
I worked on getting the data in the correct format this week to train the classifier. I familarized myself with TensorFlow and did some literature review to get a better idea of how an LSTM-CRF works as this is the neural network we will be using for the classifier. I followed the introductory tutorial here to run a simple LSTM-CRF.
The LSTM-CRF takes about 3-4 days to run on the grid so by this Tuesday, I finally had results. The results are mixed. The Clinical Trials data performs much better than EBM-NLP (the manually annotated corpus) for Interventions but performs poorly compared to Outcomes. I reported these results to the group and we decided that we will have to refine the data some more to see if the results improve.
My friend visited me for the week so we did a lot of sight seeing in Philly. We loved going to Spruce Street Harbor Park and getting pie from Magpie! We also registered for a weekend-long packrafting/camping trip with the UPenn Outdoor Adventures group for a late weekend in July so I am looking forward to that.
Professor Nenkova and I had a chat about graduate schools which was very helpful. She told me about various NLP doctorate programs along the East Coast (which is where I want to be) and their respective strengths. We also talked about a timeline for application requirements.
I tried a few different things with the LSTM-CRF to see if our results are any different. There was marginal improvement. I also ran the classifier for conditions (a subset of Participants). I spent most of this week drawing up tables to compare my results with previous work to see if this approach is viable. I also ran the classifiers with the combined dataset (both EBM-NLP and Clinical Trials) hoping that the strengths of the two datasets will manifest in this combination.
Liz and I hung out properly this weekend and it was a lot of fun!
We have decided that the work I have done so far with the Clinical Trials data is not useful in improving the classifier. Our end goal is to have all the medical literature on PUBMED be annotated and our current classifer is not precise enough to be deployed at this large scale. This is a little concering because we had high hopes from this dataset.
Instead, we are now looking to narrow down on conditions and see if we can use a look-up approach to detect mentioned conditions in an abstract. For this, we are using existing disease ontologies like SNOMED and ICD-10 as well as a compiled list of conditions from Clinical Trials and the manually annotated EBM-NLP corpus. The reason for focusing on conditions only is so that we can at least categorize all PUBMED literature by the condition it is studying.
I spent the weekend in NYC and I was amazed how much cheaper everything in Philly is compared to NYC! I love food and Philly is definitely a food lover's paradise.
I spent most of this week figuring out how to use SNOMED as their API is quite restricted. Once we got a license to download the database, we had to set it up in a local database so we could query it. I made progress with using ICD-10 for looking up conditions but the results were dismal. We believe this is because the language in ICD-10 is very different from the language in medical literature. For example, ICD-10 fails to capture "autism" as a condition in the abstract because autism is coded as "Autistic Spectrum Disorder" in the ontology.
I spent some of my free time narrowing down a list of graduate schools and appropriate programs, keeping in mind the tips that Prof. Nenkova had provided. I also visited the Penn Museum!
I made progress towards using SNOMED to look-up conditions. SNOMED has 7 different semantic tags and to look up conditions, we used the disorder tags. We found that SNOMED gives us higher precision and recall than ICD-10 codes. However, SNOMED over-predicts and we hypothesise that this may be because it picks out conditions as well as outcomes and/or interventions. This is because we are doing a blind search without learning the context that is so crucial for an LSTM-CRF.
For next week, we will be looking at this hypothesis in greater detail and will try to quantify why simple look-up measures will not work for the task of annotating conditions in medical literature.
Unfortunately, the camping/packrafting trip to the Poconos was cancelled because of inclement weather so that was a bummer. I stayed in for the weekend and worked some more on my project.
I spent this week quantifying how context-free learning is important to annotate medical literature. I compiled tables that compared all our results together because I used many different types of lookup tables. I spent most of this week wrapping up the loose ends of the few different things I did this summer.
As my weeks here are drawing to a close, I realize that I will really miss Philadelphia. I had an amazing summer outside of work too - I learnt how to cook traditional Pakistani food and really came to appreciate living independently after 4 years of college.
It's sad to think that this is my last week here. I spent most of it wrapping up my project and on Friday, during the call with Professor Nenkova's collaborators, I presented my findings to them. We agreed that the work is still in its early stages and that I will continue working on it after the summer. I had a wonderful time working with Professor Nenkova and her group this summer and I am happy that I will be able to remain involved even after DREU ends. This was my first research opportunity at another school, and in pure NLP, so not only did I learn a lot about the field, I also got a taste of graduate-level research.
Saying goodbye to Liz was hard because we had gotten close during the last few weeks. Even though she'll be in Cincinnati and I will be in New York, I hope we can stay in touch and visit each other.