Week 1
I began my first week learning about the project of making predictions of changes in stock prices using information from financial news, as well as familiarizing myself with the tools that I will be using. The articles I read gave background of how to extract the necessary semantic information from texts and discussed methods in predicting changes of stock prices.
I also began learning about bash scripting and followed a step by step process (which was already written using bash scripting) in how the financial news data is prepared (prior to determining the accuracy of the predictions). Each step in the script was commented out, and I went through it, uncommenting the particular step I wanted to run and commenting out the previous step, until I ran each step of the process. After the execution of the whole process, I was able to understand what was going on at each step, and documented how the text was being prepared at each of these stages.
I also began learning about bash scripting and followed a step by step process (which was already written using bash scripting) in how the financial news data is prepared (prior to determining the accuracy of the predictions). Each step in the script was commented out, and I went through it, uncommenting the particular step I wanted to run and commenting out the previous step, until I ran each step of the process. After the execution of the whole process, I was able to understand what was going on at each step, and documented how the text was being prepared at each of these stages.
Week 2
One of the steps in the data preparation is to locate the names of the companies using named entity recognition. The difficulty with this task is that there are many variants of a company's name, so each one needs to be captured at this stage. I spent time locating instances of company names that weren't captured in this process, in order to determine why they were initially excluded and to determine how to best include these instances without introducing errors. After noting the places where the company names weren't captured, I found that the named entity recognition stage had to be done at an earlier stage. I also noticed some general patterns of why these instances weren't captured, and began fixing some of these issues by adding to the Java code, where the algorithms are written. Next week I will see if there is an improvement in the results (since the script takes a while to run due to the large amount of data).
Week 3
After the script finished running, I observed the new data, following the two changes I made, and found some interesting results. Although the data in the results captured the new instances that I was aiming for when doing named entity recognition at the earlier stage, there were also side effects, which would negatively affect the later stages in the data preparation. Upon closer observation and further analysis, I determined that the negative side effects outweighed the few new instances that were captured, and therefore, that the named entity recognition stage actually shouldn't be moved. After the second change, where I added my own code, I compared the counts of company mentions prior to my modifications of the code with those after the changes. The number of company mentions that were captured rose significantly, and after closely examining the differences in the data, I noticed that a majority of these additional captures were correct. There were a few issues following this change, but none of which are unable to be fixed, and overall, it yielded great results.
Building upon the idea to capture more instances of company names, the next step was to incorporate coreference resolution - inexplicit references to an entity. Right now, the algorithm only captures instances where company names are explicitly mentioned, but with the use of coreference resolution, more instances would be captured, such as pronouns that refer to the companies. I used the coreference parser provided by Stanford CoreNLP to provide me with the data I needed to incorporate it in the code and capture those instances in the text. I spent the remainder of the week working on this code. I ran into some problems, but will continue to debug and hopefully complete it next week!
Building upon the idea to capture more instances of company names, the next step was to incorporate coreference resolution - inexplicit references to an entity. Right now, the algorithm only captures instances where company names are explicitly mentioned, but with the use of coreference resolution, more instances would be captured, such as pronouns that refer to the companies. I used the coreference parser provided by Stanford CoreNLP to provide me with the data I needed to incorporate it in the code and capture those instances in the text. I spent the remainder of the week working on this code. I ran into some problems, but will continue to debug and hopefully complete it next week!
Week 4
There were a few additional modifications I had to make to the code before it was ready to begin running on the data set. It wasn't running exactly the way it should with the sentences I was testing it with, but after making the necessary changes and running it a couple more times, I determined that it was ready to run on the full data set. Since there is a huge amount of data, it can take a few days for the program to finish running. In the meantime, I wrote the documentation describing what happens at each stage of the data preparation. The script calls on Java programs at each of these steps, so I went through the source codes and described what happens to the data at each one. I also read some articles about the Stanford CoreNLP Coreference parser to gain insight into its internal workings.
I began looking at some of the data that was produced (even though the program is still running), and can already detect some issues with the Coreference parser. These problems may be able to be fixed by removing elements from the source code of the Stanford parser that are irrelevant for the current task. I read the articles describing the coreference parser in order to learn about the process that the parser goes through and to determine the relevant aspects and those that may be causing errors in the results. I will look at the code in more detail next week and modify it accordingly.
I began looking at some of the data that was produced (even though the program is still running), and can already detect some issues with the Coreference parser. These problems may be able to be fixed by removing elements from the source code of the Stanford parser that are irrelevant for the current task. I read the articles describing the coreference parser in order to learn about the process that the parser goes through and to determine the relevant aspects and those that may be causing errors in the results. I will look at the code in more detail next week and modify it accordingly.
Week 5
I looked at the source code and documentation of the StanfordCoreNLP parser and by learning more about how it works, I was able to determine the portion of the code that would need modification in order to yield the best results for the data that I'm working with. I experimented with the code to figure out which aspects were causing the errors by selecting elements to remove and observing the results. After much experimentation and testing, I found three elements that were causing many of the errors, and removed them from the source code. I am currently running this modified version of the code on the data and will examine the results next.
Week 6
I expected the new results to be an improvement from the previous version; however, they did not seem to be. When I couldn't find the source of the problem, I ran the data again to see if it would yield the same output. When it did not, I realized there must have been something wrong with running the program the first time around. There must have been a slight glitch in the system when I ran it the first time, so I did so a few more times to ensure that I obtained the correct results.
I also began to learn about the later stages of the process (after the named entity recognition and the coreference resolution), so that when I have all the new data, I can compare it to the previous one in order to gain a quantitative measure of the overall improvement of the new version. When running these stages on the old data from my computer, there were some problems, but after some troubleshooting and modifying the script, I was able to get the baseline results.
I also began to learn about the later stages of the process (after the named entity recognition and the coreference resolution), so that when I have all the new data, I can compare it to the previous one in order to gain a quantitative measure of the overall improvement of the new version. When running these stages on the old data from my computer, there were some problems, but after some troubleshooting and modifying the script, I was able to get the baseline results.
Week 7
After solving all the issues with running the program with the modified version and the coreference parser, it was time to run the remaining stages of the program to obtain the prediction scores. Before doing that, I manually looked at 10 documents to observe the number of company mentions that increased from the original program to the new one (since obtaining the number of mentions for the portion of the data that I was dealing with would take a long time). The number of mentions increased by about 14%. I also obtained a count for the additional number of sentences that had company mentions (rather than the number of mentions themselves). This number increased by over 6,000 from the old program to the new one (so the number of mentions would be higher than this). These results looked promising. However, after the remaining steps were complete, I looked at the results from the prediction scores and was surprised to see that the scores were around the same. It seemed that a majority of the new instances captured contained similar semantic information from instances already captured, and as a result, the new captures weren't contributing to the overall score. This was telling, nonetheless, since the goal is not to capture as many mentions as possible.
Week 8
I then looked closely at the data to determine if the additional information was indeed irrelevant or old information, or if there was another factor that prevented the increase in prediction score. I looked at two companies in detail - both where the number of sentences increased significantly from the old to new program, but one where the overall prediction score decreased and the other where the score increased. I went through all the articles that contained additional sentences captured, and noted the instances that were captured by both programs versus the instances captured by the modified program. This process took a long time since there were many articles. I also noted whether the additional mentions were named entities or a coereference captures. Whether the additional captures were named entities or coreference mentions didn't appear to have an effect on the results.
Next week, I will look more closely at the results of the semantic parser to see if I can figure out why the mentions increased for both companies, but where the overall prediction score increased for one and the company whose overall prediction decreased for the other. It will be difficult to determine how to fix this problem since the goal of the named entity recognition is merely to capture company names without taking semantic information into account.
Next week, I will look more closely at the results of the semantic parser to see if I can figure out why the mentions increased for both companies, but where the overall prediction score increased for one and the company whose overall prediction decreased for the other. It will be difficult to determine how to fix this problem since the goal of the named entity recognition is merely to capture company names without taking semantic information into account.
Week 9
I began looking at the results from the semantic parser, but there were many new additions from the old program to the new program so that it was difficult to determine what was beneficial versus what wasn't. It was decided that this was not a useful approach for figuring out why the prediction performance increased for some companies and decreased for others, where, for both, the number of mentions increased. I instead began to document the rest of the stages in the pipeline, from after the named entity recognition until the very end.
In addition, a few weeks ago, I had applied to a conference in Rome on text mining, where I submitted a paper based on my summer work. I found out on Friday of this week that the paper was accepted! I am really excited to present the work that I have been doing as part of this DREU experience.
In addition, a few weeks ago, I had applied to a conference in Rome on text mining, where I submitted a paper based on my summer work. I found out on Friday of this week that the paper was accepted! I am really excited to present the work that I have been doing as part of this DREU experience.
Week 10
In my final week, I split my time between completing the documentation of the pipeline as well as working on the presentation for the conference that I will be attending. I will be giving a 30 minute oral presentation, so I am currently putting together a PowerPoint. I look forward to going to Rome to present my work from this summer!