Week 9

The script to scrape Telugu text from blog pages is complete! I ran it on several pages and the data was collected properly. At first, I tried finding all the links on each blog page to add to a list of pages to visit. The urls were inconsistent as was the html. I then found a collection of archive pages located on the side of every blog page. When I looked at the archive pages, I was happy to see that the urls were consecutive, and that the Telugu text was found in the same html tags on every page. Using a for loop, I iterated through the blog's urls by year and month, and then parsed each page and wrote the desired text to a file. After running the script on the blog that it was intended for, I found other blogs of the same format and was able to successfully scrape data from there as well. So far, I've scraped three blogs. This week I also fixed a TTS bug. Erica noticed a mistake in the utterances of one speaker. The number of text files exceeded the number of audio files. I updated the script that creates the utterance file, and now the data is accurate.

[next]