08/03/2012: Analysis and Progress While I was waiting for the Bulgarian and Tamil translation mark-ups, I attempted to try a markup of the translation errors with Spanish. I took about four years of Spanish over high school and college so I thought maybe I would be able to decipher Google translate. It was a good experience to know exactly the process I was putting the native speakers through, but I realized that I didn't know the language enough to tell if a native Spanish speaker would understand the non-literary definition. I would be able to tell if the phrase was translated but even if it is sometimes people don't understand what it means. On a quick analysis of the results of the marked up translation errors from Bulgarian and Tamil we found that the translations we more often not translated then translated. For Bulgarian 12 out of 17 sentences (or 70%) were mistranslated or the translation was garbled. Then for Tamil 14 out of 17 sentences (or 84%) were mistranslated. Additionally Ani annotated the sentences with other interesting tags. She was trying to figure out if non-literal errors were one of the major translating errors. In order to look at that she marked other places in the sentence were there was a notable translation error. For the most part most of the sentences averaged 0 to 2 other errors in the sentence, and because almost all the sentences contained only one non-literal phrase this meant that there were quite a few times when there were more other errors than non-literal ones. Of course sometimes the non-literal phrase was translated correctly. Ideally we would like to get a couple more languages to compare these results with. I would like to see how the translation errors differ from language to language. We only have about 2 weeks left to do work and it is taking a lot of time just to get these two mark-ups so we will see how lucky we are in getting another one. On a different note the Mechanical Turk survey is going well or at least it is getting done. Since the first batch finished in a day, Ani asked Ethan to put another one out there and although it took a little longer, we have 40,000 responses to work with now. At least that was the maximum. We had to make a quick pass at the responses we received to try and weed out the responses that are not very useful. For starters we had to decide the minimum time it would take to probably do a HIT and that way we could weed out any person that just submitted random answers. Since I wasn't the one doing it, I forgot the exact number but Ethan got rid of all the responses that were less than 35 seconds. Personally I feel that is too low but I guess sometimes that could be the case. I feel like it takes my half a minute to read the lead. When we took a quick glance at the results it didn't seem to be looking up. The problem is that everyone has their own idea about what makes a good lead and also everyone interprets the questions differently, or even the scale. People have different opinions on what is interesting. It could be that half the people mark everything interesting because that is just how they are while other people make interesting an honor and only select a small few for the most interesting. Then there are those people who just answer the center for everything. When people don't know what to say they just put the indifferent middle slot and that's not very helpful. We also spent some time discussing the scale. We had a debate about whether 3 was the neutral center for positive and negative sides or three was just a level where the whole scale was on the positive side. After we discovered this we decided that we may need to change how the questions are asked because with the questions we asked it was possible that the scale was supposed to mean different things for different questions. We decided to analyze the responses more but change the layout and the questions before we sent out the next batch. We already tried to look at the response agreement of the leads on the questions but because of the scale there wasn't much agreement on the questions. The scale was from 1 to 5 and it did help a little if we combined 1 & 2 and 4 & 5. We didn't have time to look into the individual leads yet but some of them now have less than 10 responses because of the weeding out of people who didn't make the time cut.