[written in cooperation with Deborah Lamb]
MotivationConnect indexer and search
We are working with Professor Janice Cuny on one of her ongoing research
projects called VRV-ET, Virtual Research Vessel Educational Tool. It is called a virtual research vessel because it is meant to contain a compilation of up to date research from several geological projects. It is referred to as an Educational Tool due to its unique ability to provide students the chance to be actively involved in the scientific process of geological research. It achieves this by allowing students to look at the data, pictures, and other documentation of currently active research from geological sites. This produces an environment where they can analyze the data themselves and come up with their own conclusions, as well as watching the progress of the on site team of geologists. Currently the system is being used by geologists to create web based presentations to be viewed by students.
Both Professor Cuny and Megan Chinburg, an undergraduate student at the University of Oregon, have made some observations regarding areas of VRV-ET that are in need of improvement. Professor Cuny suggested that there should be better protection of the users data, controlling access to the data to protect one users work from being edited or deleted by another user. This would involve implementing a system of permissions so that a user would have to have the correct permission to edit the data as opposed to other users who would only be able to read the data. She also suggested that there should be an easier way for someone to retrieve data from the system rather than manually looking through all of the projects stored on VRV-ET as the number of projects could become quite significant in the future. Megan felt that VRV-ET's current interfaces were not well designed and that they are very hard for users to navigate through and use. She had a large number of of suggestions for small edits to the system to fix these problems and make the system easier for first time users to learn.
We chose to work on a data retrieval tool to improve the functionality of VRV-ET. Our tool will allow users to search through and access the data stored in VRV-ET to retrieve images fitting their search criteria. We chose this project because it seemed like the most fun to implement and it added the most immediately useful functionality to the system.
There has been a great deal of work on data retrieval systems and text-based retrieval systems since they are widely used in real world applications. Google is a good example of a text-based image retrieval and indexing system.
Google employs a software robot called Googlebot to collect Web pages. Then it builds an internal representation of all the Web pages. For each possible query word Google stores all the documents that contain it. When the user enters a query Google will find all the files that contain all these words.
Before Google, search engines often look at the occurrence of a query word to determine the importance of the page. Google not only looks at the occurrences of the query words in a document, but also analyzes the distribution of the keywords and relies on PageRank(TM) to determine the importance of a web page. PageRank makes use of the democratic structure of the web links. A web page which receives more links from other pages is considered more important than a web page which has less links that point to it. However, PageRank goes beyond the sheer volume of links received by a page, it also analyze the quality of linking pages. Pages that are high quality themselves help to make other pages they link to more important.
Google also provided us with some inspiration for interfaces since it has a simple design that most people understand. Megan Chinburg explained the importance of easy to use interfaces in her analysis of VRV-ET. A user should be able to quickly achieve their goals with as little frustration as possible.
Google is not the only search engine that we looked at for reference. GeoGuide is a search engine developed at a university in Germany. This was exceptionally helpful since they describe some of their challenges with the project. One thing that they mentioned was that the advanced search was almost never used as was the advanced version of the page that was to be filled out when a person submitted a site that they wished to be indexed. This was taken into consideration during the development of our user interfaces.
While we did use Google as a reference we did not think that just using it in our system was a good idea because we would like to search for specific information about the pictures such as author, location, a ranged search for photographs taken between 2 dates, among many other things. Google would only allow us to search by the file name, and information located around the picture in the web page, which may or may not have anything to do with the picture. Also we would someday like to extend the system to do more than just search (which we will talk about in future work), which would not be possible without a tailored search.
Most of our project is implemented using Java. We choose Java for its simple APIs and platform independence. For our project we use Digester to parse XML data and Lucene to handle indexing and searching. Digester is an open source software from Apache. It makes use of SAX parser to parse XML documents into Java objects. Like Digester, Lucene is an open source software from Apache. It is an efficient, full-featured text search engine written in Java.
The project has 2 main parts. Indexing the data on VRV-ET and searching through that index. Here is a list of tasks that our project will require us to complete.
Parse XML using Digester
Index Digester results
Decide on a file structure
Find a way to rank the search
Output the index results using chosen file structure
Search by author, project, and any other field
Search from day x to day x
Add interfaces and other helpful features
Connect to VRV-ET
While thinking about an appropriate file structure we were informed that the storage of the data would change over time. We felt that it would be better if our program could easily handle these changes, so we designed our system such that it accepted certain parameters rather than depending on the structure of the file system with the idea in mind that we would later add a Data Access Object design pattern. As a result of this decision our program is written in a very generically so that these things can be changed easily.
We have successfully indexed the projects on VRV-ET. We have built a full text searchable index from XML files. Our system is capable of indexing file content, specific fields in a file, as well as meta data stored for images. Our system uses Digester to convert XML document into Java objects. By using Digester, we are able to parse specific elements in an XML document as well as the whole XML file. Digester offers the flexibility which fits our system well. These Java objects are then used to create Lucene documents which are indexed using an index writer.
We have also implemented a keyword search of the indexed files. We have the capability to perform an advanced search of those files once the geologists have entered the information about the pictures that they have submitted. This advanced search can search for specific fields within those files such as author, location, and a range search for dates within the files. Thanks to our design strategy we can search for any field in any way simply by writing a new html form and one short Java class that implements a generic interface.
We did a lot of reading about ranking searches and discovered that many of the best web search engines got their status by finding a way to rank the searches such that web-masters have a harder time tricking them. Many businesses want their web sites to rank higher in the searches so that they can get more visitors to their sites. Since we are only searching for things on a web site created by geologists for educational purposes we will not have to worry about these things. Another standard but good way to rank search results involves location and frequency. Lucene by default ranks it's results by the frequency of the query within the field. It also ranks a result higher if the query occurs within a shorter field. We felt that this was a very good ranking scheme for our purposes so we did not implement any changes to the default ranking scheme.
We designed our interfaces to be very easy for a user to understand. Since the system currently only allows a keyword search, It simply asks the user to check off the items that they wish to search for then click on a button to submit the search.
The results show up as a table of thumbnails below the search form.
A user can then click on the thumbnail to view the actual file.
We felt that these interfaces were rather intuitive, however we were not able to test this theory due to time constraints.
Throughout the process of completing these tasks we have learned a great deal. We now know how to use tomcat, write servlets and JSP pages, some design patterns, and many other useful skills. We have also learned a lot about XML and how to use it to create build files and other necessary files for the use of tomcat and ant, and to organize data. We also now know how search engines work, including how they typically index the web sites and the criteria they use to search through that index.
If we had had more time to work on this research project we would have further changed the implementation of the indexer and searcher so that they used the Data Access Object design pattern. This change would allow our indexer and searcher to retrieve the information from any kind of data base or file system, and not know where the information was coming from. Currently our program requires the location of the files and the rules to be passed in as initialization parameters.
Since much of the information that my searcher is capable of finding, such as the author, project, and date the picture was taken, is not currently stored in the repository we would need to create a system to add that information. Also a new program would need to be written to provide the geologists with an interface to add the needed information into the meta data stored about a picture when they add a new one to their project.
In the future of VRV-ET our indexer and searcher may be used to search through the system to find specified information and create a presentation or slide show from it. This would require our system to order the results such that the most similar ones are near each other. This would require months to complete and is a possible research topic for another student.
Links To Reference
How Search Engines Rank Web Pages