Final Report
Background

Nowadays, many people use internet search engines look for answers. However, search engines always return a lot of results. It can be confusing and challenging for users to decide which result is the one they really need. My mentor, Wanda Pratt, developed a new system called Dynamic Categorization (DynaCat). DynaCat automatically organizes medical search results into meaningful groups corresponding to users' queries. In order to do such a thing, we must have a database so that we can retrieve data from or for the user. Unified Medical Language System (UMLS) is a long-term National Library of Medicine (NLM) research and development effort designed to facilitate the retrieval and integration of information from multiple machine-readable biomedical information sources. Thus DynaCat uses UMLS as a database. My role in this project was to download UMLS and setup the database, make a connection to it so that we could manipulate the data from UMLS remotely.

Project description

Unix
My first challenge was learning how to use the Unix box and learn all the basic commands because all of my work would be done on Unix. As a person who has done most of her work on PC and didn't know anything about DOS, I thought I was in big trouble. It was frustrating at the beginning; I had to work on Unix while I was still learning it. However, everything takes practice. After I stared using Unix all the time, one day I realized that I had learned all the common commands by heart.

UMLS
Before we started to do any actual work, like downloading UMLS, etc. we had to read a lot of documents online.

My first task was to find out how big UMLS is. The full Metathesaurus occupies 3.00 GB and a minimum of 8 GB free disk space is required to use MetamorphoSys. Then we tried to decide what database software we would use to hold UMLS. Cathy, the grad student we've been working with, suggested MySQL, but she was not sure whether MySQL could handle such a big database. So we looked for more documents and found out MySQL was able to do the job. Then we went ahead and downloaded UMLS. During the unzip process, I learned how to unzip and untar on Unix.

After we downloaded UMLS, we started to write load script. There are three types of files in UMLS: Metathesaurus, Semantic, and Lexical. Luckily, we found load script for 16 Metathesaurus files online. That helped a lot. It gave us a much better idea how to write the rest of the Metathesaurus, Semantic, and Lexical files. It would have been pretty difficult to write those files by ourselves especially since we were not very familiar with the data in UMLS. This was because we didn't know the type, size, and primary key situation of each column. The files are just too huge to look at each row of them and try to figure out all that information manually. Luckily, after a meeting with Cathy, we decided to set all columns we were not sure to size 255. Later we used MySQL to check the size of each column.

JDBC
While Alice was working on setting up UMLS on MySQL, I was working on Java Database Connectivity (JDBC). I learned in order to use JDBC, we needed to have Java, JDBC API, Java Development Kit (JDK), and JDBC driver. Since Java was already set up on the Unix, and Java Servlet Development Kit (JSDK) which contains JDK was set up on the Unix as well, the last thing I needed to do was download the JDBC driver. Since we were using MySQL as our database server, we needed some specific drivers that would work with MySQL. I found two drivers that both would work with MySQL from MySQL official homepage, so I thought both of them should be reliable since MySQL recommended them. After I read both webpages, I decided to go with mm.mysql driver. There were two steps to installing MM.MySQL. First, I needed to unpack the files I downloaded which was in *.jar format, and second I needed to add MM.MySQL to the CLASSPATH. After JDBC was successfully installed, the next step was to establish a connection. Here is the code. In this code, this program will call the driver and make the connection to the UMLS in MySQL, then insert a row into an existing file, then display a selected column. This code can retrieve data from UMLS which is on ender from galt; i.e., it can access the database remotely which is a very important goal of this project.

Java Program
After we were sure that the JDBC driver worked, we started to write java programs. One was Driver.java which is a hard-coded program. The other one was UMLSDB.java which opens the connection to the database once and then accepts a string as a parameter to several methods. The methods included: GetCUI, GetConceptName, GetSemanticType, GetPosition, and GetMeSH.

GetCUI
Input: phrase
Output: CUI of the concept (direct match of input)
Example: "breast cancer" --> "C0006142"
Query: SELECT MRCON.CUI FROM MRCON WHERE MRCON.STR=s1;

GetConceptName
Input: phrase or CUI
Output: concept name (direct or approximate match of input)
Example: "breast cancer" --> "Malignant neoplasm of breast"
"C0006142" --> "Malignant neoplasm of breast"
Query:
Input is a CUI:SELECT MRCON.STR FROM MRCON WHERE MRCON.CUI=s2 AND MRCON.LAT='ENG' LIMIT 1;
Input is not a CUI, so call GetCUI first
CUI = GetCUI(s2);
SELECT MRCON.STR FROM MRCON WHERE MRCON.CUI=CUI AND MRCON.LAT='ENG' LIMIT 1;

GetSemanticType
Input: phrase or CUI
Output: Semantic Types of the concept represented by the input, separated by "|"
Example: "breast cancer" --> "Neoplastic Process"
"C0006142" --> "Neoplastic Process
Query:
Input is a CUI:SELECT MRSTY.STY FROM MRSTY WHERE MRSTY.CUI=s3;
Input is not a CUI, so call GetCUI first
CUI = GetCUI(s3);
SELECT MRSTY.STY FROM MRSTY WHERE MRSTY.CUI=CUI;

GetPosition
Input: CUI or text of a MeSH term
Output: MeSH tree numbers separated by |
Example: "C0006142" --> "C4.588.180|C19.146.170"
"Breast Neoplasms" --> "C4.588.180|C19.146.170"
Query:
Input is a CUI:SELECT MRCXT.HCD FROM MRCXT WHERE MRCXT.CUI=s4 AND MRCXT.SAB='MSH2001' AND MRCXT.CXL='CCP';
Input is not a CUI and input is a MeSH term, do lookup by CXS.
SELECT MRCXT.HCD FROM MRCXT WHERE MRCXT.CXS=s4 AND MRCXT.SAB='MSH2001' AND MRCXT.CXL='CCP';
Input is not a MeSH term
CUI = GetCUI(s4);
SELECT MRCXT.HCD FROM MRCXT WHERE MRCXT.CUI=CUI AND MRCXT.SAB='MSH2001' AND MRCXT.CXL='CCP';

GetMeSH
Input: phrase or CUI
Output: MeSH term
Example: "breast cancer" --> "Breast Neoplasms"
"C0006142" --> "Breast Neoplasms"
Input is a CUI:SELECT MRCXT.CXS FROM MRCXT WHERE MRCXT.CUI=s5 AND MRCXT.SAB='MSH2001' AND MRCXT.CXL='CCP' LIMIT 1;
Input is not a CUI, so call GetCUI first
CUI = GetCUI(s5);
SELECT MRCXT.CXS FROM MRCXT WHERE MRCXT.CUI=CUI AND MRCXT.SAB='MSH2001' AND MRCXT.CXL='CCP' LIMIT 1;

Please look at the complete code of Driver, UMLSDB, and the sample output.

What's Next

The next step is to work on the aproximate match in the Java program. For an example, if the input is "increased body temperatrure", it should return concept "fever". That involves using Lexical Variant Generation (LVG). I started to read documents about LVG, but I didn't have time to finish it.

Conclusion

This was a totally new experience for me. Most of my learning has been listening to professors and understanding what they've teach me, but in this project I taught myself by looking for helpful material online. I suppose this is the whole point of a "research" project. I do the "research" myself. And I think my skills of finding and differentiating online sources have improved.

One important thing I got out of this research project was that I learned Unix. For a long time, I worried that I only knew one OS, PC, or Mac (if you count that). As a CS student, that was really too little. However, after this DMP, now I can add Unix the the list.

I've taken one Database System class and that was my weakest class. I didn't do well in that class at all. I am glad I had this chance to do a real-life Database system task. It also gave me a chance to refresh my memory of SQL. After I go back to school, I might get a job from a company to set up their Database. The DMP experience definitely will help me on that job.

I've taken many programming classes, but I had never gotten a chance to use them in my previous jobs or internships. I am glad I used Java for this research project. I actually used something I learned from school!

The DMP also gave me a little taste of grad school. After interacting with many grad students and going to a final defense, I have a better idea of what grad school is like and what the grad school progress is like. I think BS is not enough for me; I'll definitely go to grad school after I finish college.



Home
Weekly Journal