On the first day Christine and I met Wanda in her office and got an overview of the Dynacat project. Medical informatics is a field that I knew nothing about previously, so there is a lot to learn. Then we went next door to meet Cathy and Yun, Ph.D. students working in Wanda's research group. Cathy told us more about how the Unified Medical Language System (UMLS) maintained by the National Library of Medicine (NLM) is used by Dynacat. Currently Dynacat uses the UMLS API to retrieve information over the web. The first thing we are going to be working on is writing software to download the large UMLS knowledge base and put it into a local MySQL database server. We ate lunch with Wanda, Cathy, Yun and another Ph.D. student, Madhu. We spent the afternoon getting our UNIX accounts set up with the help of support.
Tuesday we met with Cathy who went over specific steps for making the UMLS into a MySQL database. She showed us the shared account medinfo. We need to unpack the tar'd file UMLS2001. We need to look for the script that loads the UMLS into a MySQL database. The final step will be updating Dynacat to read from the local MySQL database, instead of using the UMLS API.
Group lunch on Monday: Wanda, Yun, Jack, Henry, Cathy, Christine and me. We got an update on what everyone is working on and talked about getting ready to put the UMLS into a local MySQL database.
We gathered information on what size MySQL can handle and what size the UMLS is to see if MySQL is up for the job. Tuesday we met with Cathy in the afternoon. She gave us a copy of her Details.doc, a Dynacat Interaction table explaining more clearly how the current implementation works.
We presented to Cathy what we had found in terms of different ways to gain access to a database remotely and she told us about Microsoft VisualStudio.net. Basically we came to the conclusion to go ahead with the plan of putting the UMLS into a MySQL database, and will probably use JDBC.
From the UMLSinfo page I found a load script for MySQL, exactly what we had been looking for last week. It loads 32 Metathesaurus files into 32 MySQL tables. We need to understand how these are written and write load scripts for the remaining UMLS files.
We have downloaded the most recent stable version of MySQL: 3.23.39. Soon we will install it. I am writing a script of UNIX commands to make dummy files (containing the first 20 lines of the actual UMLS files) in order to test the load scripts.
We will be working on coming up with test strategies for both the dummy db and the real db. It will involve checking the number of rows and columns, and pulling out some data and comparing it with the original.
Group lunch on Wednesday was with Wanda, Cathy and Christine. We met with Cathy on Thursday to report on how the dummy db was going. We had lots of questions about MySQL and databases in general. We printed the MySQL manual on the heavy duty printer. MySQL is installed, but setting up everything is definitely challenging.
We looked for information about having the webserver (Tomcat) and the db (MySQL) on the different machines. I had not heard of Tomcat before, so here's some info:
What is Tomcat? (from the Jakarta FAQ)
Tomcat is the servelet container that is part of the JavaServer Web Development Kit (JSWDK) available from Sun. The source code for Tomcat is being provided to the Jakarta project for further development through the Open Source Model.
We discussed with Cathy how were going to figure out the column types for the Lexicon and Semantic Network tables of the UMLS. The problem is that some columns need to be varchar (N) where N is the maximum number of characters contained in any field in that column. It is not something to be done manually because many of the tables in the UMLS contain thousands of records. We talked aboutthe advantage of having the column type as exact as possible and how many columns are there that we do not have a column type for.
The plan is to create the tables with columns of unknown types as varchar (255) and write a MySQL script which finds the number of characters in the longest field in a column. One line for each column of unknown type can do the job:
SELECT MAX(LENGTH (column_name)) FROM table_name;
Also, we will check the error logs after the tables are created to see if any field are truncated at 255 characters, indicating that the column type needs to be text, not varchar, because varchar has a limit of 255 characters.
I spent some quality time with the MySQL manual, especially Chapter 6, the MySQL Access Privilege System. Played around with the commands for starting the server, shutting it down, creating users, granting and revoking privileges, etc. Then using loadscripts I created UMLS tables, ran other scripts to determine the correct column types, changed the loadscripts and re-created the tables.
The file system where the UMLS database is stored became full before I had all the tables indexed how I wanted them, so I deleted unneeded files and moved other files to open up some space. I also used a MySQL tool called myisampack to compress one of the tables. The result of myisampack is a read-only table (just fine for our purposes) which is 40-70% smaller than the original. In the case of the Metathesaurus table which I compressed, it reported that it was 63.72% smaller (before: 87,731,296 bytes and after: 31,825,677 bytes). Supposedly the myisampack compression does not create much access overhead, but I chose a table I don't think will be used much (if at all) by the current implementation of Dynacat, just in case.
All 40 UMLS tables are in MySQL with column indices on what appears to be the important columns. We created 8 of the UMLS tables to have primary keys. We can reuse the loadscripts to recreate the tables with new primary keys if we realize it is necessary when we see how the tables are being accessed by DynaCat.
Christine was able to connect to the MySQL database remotely using JDBC. woohoo! We met with Cathy to go over the next step: updating the DynaCat code to read from the local db instead of use the UMLS API. We have started looking at the actual java servlet code of DynaCat and we are going to read Chapter 3 of Wanda's Ph.D. dissertation in order to get an idea of how information from the UMLS is used by DynaCat. Next, I will read more about the UMLS API, specifically for the Metathesaurus. Some of the commands might not directly translate to a select statement on one of the already existing tables, so we might need to figure out what tables should be joined to have the needed information available to DynaCat.
Thursday we went to the final defense of the dissertation of Ph.D. candidate in the AI section of ICS here at UCI. It was titled "Time Series Similarity Searching". He was a good speaker, because I was engaged and even semi-following the presentation even though I walked in there not even knowing what a time series is. I was surprised by how mathematical his research is. We waited in the hallway while the committee met. Then they came out into the hallway saying "Congratulations, Dr. Keogh!". What a rush. It was exciting even though I don't know him.
We helped Cathy download medical journal articles for her meta-analysis of studies of breast cancer that mention smoking and alcohol. Basically I understand meta-analysis to be a study of studies: a set of statistical tools to enable a standardized way of analyzing results from different studies to determine if there is a statistically significant result.
I worked on making a MySQL tip sheet, so that Cathy or Christine would have a quick reference when dealing with the database, instead of the massive online documentation or manual. The tip sheet covers bringing up, shutting down, and connecting to the server, creating a new user, using the logs, compressed tables, batch mode and basic commands (create table, load data, create index, alter table, show, describe and drop).
I went over the UMLS documentation and the files in the UMLS download to review which files have been put into MySQL and make a list of tables which might be unnecessary.
I spent some time reading the Emacs documentation to figure out a couple settings.
To set the printer-name variable, go to the *scratch* buffer and type:
(setq printer-name daisy)
then Control-j.
then to display the value of the variable, type Control-h v printer-name RETURN
To get colors to show up when editing code:
Esc-x customize RETURN --> Faces group --> Font Lock group --> Global Font Lock Mode (toggle to ON).
We also worked on editing the JDBC connection so that MySQL can be accessed from galt or other machines, not just ender. The syntax is:
Connection Con = DriverManager.getConnection ("jdbc:mysql://[hostname][:port]/dbname[?param1=value1][¶m2=value2]...");
I made a UMLS and MySQL stats page that has the file size info about the UMLS. This is the type of information that Christine and I were looking for during the first week.
We worked on translating the UMLS API calls in the Dynacat servlet to calls to the UMLS db. Steps for doing this are identifying the tables needed from the UMLS db and writing Java methods to implement the following:
To do the approximate match for getCUI and getConceptName, we need to use something similar to the lexical program lvg described on the UMLS information site. I looked at the lexical programs, especially norm (actually a specific flow of lvg), a program which normalizes words or strings to terms that you can find in the tables. For example, "Lung Cancer" becomes "cancer lung". norm removes stop words, removes genitives, replaces punctuation with spaces, puts the words in lowercase, uninflects each word (this can be done with table LRAGR) and does a word order sort by ascending ASCII order.
The JDBC remote connection to the database wasn't working because of the access permissions in MySQL. Now that the user has access from any host (using the wildcard '%'), the connection URL works: "jdbc:mysql://ender.ics.uci.edu/db_name?user=user_name&password=password". woohoo!
Looking at the Lexical Tools of the UMLS, there are several versions available. There is LV1_84H.TAR.GZ in the download, LV1_84H5.TAR.GZ on the UMLS Lex tools webpage is the "Updated Official 2001" 4/9/01 version, and lv1_84h6.tar.gz also on the webpage is the "2001 with Binaries for Solaris 2.5" 6/25/01 version. We need to look at the code for lvg in order to write code to do approximate match.
I focused on writing the queries for GetPosition which returns the position of a term in the MeSH hierarchy. I read through the UMLS documentation to find which table and columns I need to use. Appendix B.1.1: Column Descriptions indicates that HCD of MRCXT is the answer. HCD is the "hierarchial number or code of context member. alphanumeric value." which can contain a MeSH tree number ("the hierarchial number for the concept in the MeSH tree structures").
For example, 'breast cancer' maps to the MeSH term 'Breast Neoplasms' which has two positions in the MeSH hierarchy, represented by two tree numbers:
C04.588.180 and C19.146.170.
These tree numbers represent the following hierarchies:
C | Diseases |
C04 | Neoplasms |
C04.588 | Neoplasms by Site |
C04.588.180 | Breast Neoplasms |
C | Diseases |
C19 | Endocrine Diseases |
C19.146 | Breast Diseases |
C19.146.170 | Breast Neoplasms |
Here are some more examples of MeSH terms and their positions in the hierarchy.
I worked on finishing the GetPosition and GetMeSH methods. It was more difficult than I expected because there are some concepts whose CUIs are not the same as the CUI of the term selected to be the MeSH term name. For example, "breast cancer" maps to the CUI "C0006142", but the MeSH term for breast cancer is "Breast Neoplasms" which maps to the CUI "C0006149". So to get from "breast cancer" to "Breast Neoplasms" (or the MeSH tree numbers) I would need to somehow get from "C0006142" to "C0006149". This relationship is captured in a Metathesaurus table named MRREL with an entry like this:
C0006142|RB|C0006149| |MSH2001|MSH2001
"C0006142" stands for the concept "Malignant neoplasm of breast"
"RB" stands for "has a broader relationship"
"C0006149" stands for the concept "Breast Neoplasms"
"MSH2001" is listed as the source of the relationship and the source of the relationship label.
We met with Cathy on Tuesday and she explained a good way to divide up the code is to have one version that does direct match and another version that does approximate match, so that you know which kind of match you are getting back.
This week I worked on getting the direct match code into final form. When there is no direct match, the code will return an empty string. I went to my last group lunch on Tuesday...my last Hong Kong chow fun in Irvine!
I also wrote my final report which is here.