For the first part of this week, Graciela was gone to a conference, so I just met with one of her PhD students, Bob Leaman. Since what project I'm going to be working on hasn't been decided yet, and won't be until Graciela gets back, he started me off learning some things that should be useful no matter what I end up doing here.
All of the projects that Graciela's lab (the DIEGO lab, as it's called) works on are related to biomedical information extraction. Usually this involves some form of machine learning, so Bob thoroughly explained that. He also showed me LinkGrammar and Weka. Mostly, I ended up reading lots of papers this week. See below for which.
Once I was able to meet with Graciela, I chose to work on a project doing named entity recognition with a deep parser. Named entity recognition is where you identify certain entities, like gene or disease names, throughout the text. Since new names are being coined as new genes and diseases are being discovered, and the fact that everyone has their own way of referring to these entities, doing NER needs to incorporate more than just checking a dictionary. This is where machine learning and feature generation comes in. One of the ways to generate features is to use a shallow parser--things like part-of-speech and chunking. What I'll be doing is using a deep parser to see if this will improve accuracy. Deep parsing involves a more detailed syntactic analysis of sentences, and this way when investigating the likelihood of a word being a named entity, it can utilise information like "What are the adjectives modifying this word? What is the direct object of the action this word performs?". Hopefully, these extra features to train the system on will be useful.
Next up, research proposal!
Frontiers of biomedical text mining: current progress, by P. Zweigenbaum, D. Demner-Fushman, H. Yu, et al.
What makes a gene name? Named entity recognition in the biomedical literature, by Ulf Leser, J. Hakenberg.
BANNER: an executable survey of advances in biomedical named entity recognition, by R. Leaman, G. Gonzalez.
Tackling the BioCreative2 Gene Mention task with Conditional Random Fields and Syntactic Parsing, by A. Vlachos.
Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches, by S. Pyysalo, T. Salakoski, S. Aubin, et al.
BioInfer: a corpus for information extraction in the biomedical domain, by S. Pyysalo, F. Ginter, J. Heimonen.