myself mentor research

Week 2

This week I wrote my research proposal.

Reason:
As the number of publications in the biomedical domain expands, the need for an automated way of managing and analysing them grows. By using machine learning and natural language processing techniques, the text can be automatically processed for patterns.

In one such technique, named entities such as gene names or diseases are automatically identified. This named entity recognition is the foundation on which other text mining tools are built. These tools can only be as strong as their foundation, so NER is of the utmost importance.

The small amount of research previously done on using full parsing to aid named entity recognition has indicated that it's a promising new direction to take with NER systems. L. Smith and W. Wilbur evaluated the use of nine different parsers in their ability to improve recognition of gene names and found that they had all increased the break even value, while producing similar features. Currently untested are the use of the parser LinkGrammar and the recognition of disease names and other biomedical entities in addition to gene names.

Methods:
LinkGrammar is syntactic parser, which provides both a constituent tree and a linkage diagram of a sentence. In this diagram, linkages such as adjective-noun, verb-object, and preposition-object are represented. S. Pyysalo, et al., adapted LinkGrammar for use in the biomedical domain and found a 10% decrease in error. Whether this specially modified parser will show such performance in named entity recognition remains to be seen.

BANNER is a named-entity recognition system for the biological domain, achieving an F-measure of 86.43 at present. Currently, BANNER is limited by its inability to process sentences in whole as a series of linkages. Some of its errors could have been corrected had it access to a full syntactic parse. For example, BANNER did not correctly label "M-MuLVneo delta Enh" in the following sentence:

"However, a few sites in the genomes of EC cells permit M-MuLVneo delta Enh proviral expression. "
Parsing this sentence using LinkGrammar makes it clear that all the tokens "M-MuLVneo delta Enh proviral" are linked to "expression" as adjectives. With "expression" being a common noun used with genes, this would indicate that the series of adjectives linked to "expression" should be tagged.

By modifying BANNER to be able to obtain full syntactic data for each sentence from LinkGrammar, some of these flaws might be corrected.

Objectives:
a. Modified version of BANNER, which takes advantage of LinkGrammar's deep parsing.
b. Set of features from the new syntactic information found to be useful in improving BANNER's F-score.
c. Paper evaluating the performance of the modifications.