When searching for information from medical journals using PubMed, there are so many articles that users (including healthcare professionals, researchers, and patients and their families) run into problems finding the articles they need--they experience information overload.
Why does PubMed return so many results? It was developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) to provide web access to the gigantic MEDLINE database [1]. MEDLINE contains more than 10 million bibliographic citations and abstracts from over 4300 biomedical journals published in the U.S. and in 70 foreign countries [2]. Coverage extends back to 1966 and approximately 31,000 new citations are added monthly [3].
For her Ph.D. dissertation, my mentor Wanda Pratt created an approach to decrease information overload which relies on a query model and terminology model to categorize and organize results. Using this approach, she designed, implemented, and tested a tool called Dynamically Categorization (DynaCat) which categorizes the journal articles returned by PubMed.
Using the Unified Medical Language System (UMLS) maintained by the NLM and knowledge about the types of queries, DynaCat returns search results in a categorized manner which is more helpful than typical search engine ranked results or document cluster results [4].
"The purpose of the UMLS is to aid the development of systems that help health professionals and researchers retrieve and integrate electronic biomedical information from a variety of sources and to make it easy for users to link disparate information systems, including computer-based patient records, bibliographic databases, factual databases, and expert systems. The UMLS project develops "Knowledge Sources" that can be used by a wide variety of applications programs to overcome retrieval problems caused by differences in terminology and the scattering of relevant information across many databases." --UMLS Fact Sheet [5]Essentially, the UMLS is based on the idea that the wide variety of vocabularies used by different groups and sources in the medical community prevent effective retrieval of information. The UMLS is divided into 3 Knowledge Sources: the Metathesaurus, the SPECIALIST Lexicon, and the Semantic Network. The Metathesaurus contains almost 800,000 biomedical concepts and their various names, semantic information, and relations among them [6]. The SPECIALIST Lexicon contains over 140,000 English words and biomedical terms along with their syntactic and morphological information [7]. It was created to enable natural language processing. The Semantic Network is made up of 134 general categories (called semantic types) to which all concepts in the Metathesaurus have been assigned and 54 relationships that exist between the semantic types [8].
DynaCat uses the UMLS in several ways: first to create the categories appropriate for the results of the user's query, then to place each article returned by PubMed into the appropriate category (or categories). Depending on the type of query (the query model handles 9 query types: treatments, tests, symptoms, problems, prognositc indicators, prognosis, preventative actions, risk factors, and diagnoses), DynaCat uses the MeSH hierarchy to place the search results in significant categories. Each citation in MEDLINE is assigned keywords from the Medical Subject Headings (MeSH) vocabulary that represent the content of the article to help users find relevant information [9]. For a single article, DynaCat looks at each of the MeSH terms, checks its semantic type, checks its position in the MeSH hierarchy, and decides where the article falls in the categories of the results.
Note: "we" refers to fellow DMP participant Christine Groce and myself, under the guidance of Cathy Blake, a graduate student in Wanda's research group.
Our goal was to create a local database containing the relational tables of the UMLS, then modify the existing DynaCat code to make it use the local database. This should be beneficial in terms of increased reliability and speed. Also, the local copy of the UMLS database hopefully will enable future research to more easily process the text of an abstract or the full text of an article.
We began by familiarizing ourselves with the basics of DynaCat and client/server computing. We researched which relational database management system could handle a database the size of the UMLS (3 GB). Here are statistics on the size of the UMLS before and after the files were put into MySQL tables. We researched different ways to gain access to a database remotely. After reaching conclusions about what we thought would work, we decided on the next steps and began to do them. I downloaded and installed MySQL while Christine downloaded and installed the Java Database Connectivity (JDBC) driver. Using load scripts from the UMLS info site as a guide, we wrote more load scripts to put the UMLS files into tables in MySQL. Then using MySQL in batch mode, I loaded 40 tables into MySQL, the largest of which was the Metathesaurus table MRCXT containing over 13 million records. Setting up MySQL was challenging, a mini-introduction to database administration for me. Also, learning the logic behind creating effective SQL queries was new to me. Here are some tips from me about using MySQL. I studied the organization of the UMLS in order to be able to replace the UMLS API calls with Java methods that use JDBC to access the local database. The methods we worked on are described below:
The next step is what I had hoped we would finish before leaving UCI: changing the code in the DynaCat servlet to access the UMLS database locally, instead of using the UMLS API. Also, it is necessary to develop an approximate match technique when direct lookup of the user's query does not appear in the tables of the UMLS. Hopefully, the local copy of the UMLS will be useful for the purposes of future text-mining research.
This opportunity provided by the Distributed Mentor Project allowed me to get a glimpse of the research process through hands-on work and through interactions with professors and grad students. The fields of medical informatics and information access were new to me. I refreshed my UNIX skills and learned about MySQL, JDBC, Java, and HTML while attempting to figure out the design of the UMLS and how it fits with DynaCat. I enjoyed applying some of the skills I've learned during classes to a practical project.
A few thoughts on grad school: Success seems to depend on the ability to determine what steps will enable you to accomplish your goals, and having the self-motivation to take those steps. Also, the grad students rely on the ability to express clearly what they are working on and why their work matters.
Many thanks to Wanda Pratt, Cathy Blake, and Christine Groce for their help.