Week 4: July 8 - 12
This week was also a short one as my mentor is at the TISLR (Theoretical Issues in Sign Language Research) conference and I had lots of time to myself! It was a nice opportunity to further dive into the search engine and refine my optimizations to the internals. I finally set up a reliable benchmark on the windows machine after getting very divergent results on my informal testing. The key is to disable EVERYTHING, including the antivirus and network-related processes in order to get a consistent set of results! The benchmarks led me to realize that one of my optimizations actually worked well only for large datasets, and slowed down the search on small datasets. It's always nice to have hard data to back up your assumptions instead of going on the theory alone! Another optimization actually worked only for case-sensitive searches and wouldn't work for case-insensitive searches so it was back to the drawing board for that. I also found out that a faster XML parser actually gave us a small benefit due to the sizes of the EAF files being so small. If they were 100mB or larger, then the faster parser definitely would help but the average size of the EAF files wasn't even larger than 1mB. Furthermore, replacing the XML parser caused issues in other areas of ELAN as they relied on Apache Xerces's deprecated functions and it was too much work to hunt them down and change the code in areas that I didn't want to screw up.
The week ended with me being satisfied with the results of the search engine optimizations. I think I've done enough for now as I've reduced the search time for a simple query from 17s to 2s! There are several more opportunities to further improve the code but it was time to move on to other parts of the codebase to achieve what my mentor wanted. Fortunately, while at the TISLR conference my mentor met Trevor Johnston, another researcher working on sign-language corpus. He gave our team valuable feedback on the statistical features of ELAN and what he was doing as he recently published a paper on analysing corpus for variation in signs. That made me motivated to start work on the more "fun" areas of the codebase as I'm majoring in mathematics and will be able to apply my core skills to more interesting applications. I also had the opportunity to meet John Fallora, another student at DePaul who was part of the research team last year. His contribution was to make a Python script to generate a histogram on the length of a particular gloss in the corpus. I took my time looking around his code and thinking of different ways to integrate that functionality directly into ELAN. The biggest hurdle right now will be to figure out ways to display the data in ELAN as the current grid layout isn't that powerful. You can't sort columns, select which ones to display, and customize the view. In order to add MUCH more statistical information to the output, refining the display is a priority as gathering the data is the easier task! That concludes this week's work and I'm looking forward to next week as I dive into the display code in ELAN and start figuring out how to add data without over-crowding the view.