The next phase that I am working on is synthesizing a voice using the HTS Speaker Demo. This process includes two stages. First, we need to configure the demo using the default parameters, and second, we need to prepare the data that we will use to build our custom voice. I began the first task with downloading the demo. Next, I configured it, and ran it. The demo is still running - the synthesis process takes a few days. For the second task, I created a .data file to list the sentence files and their contents in the format specified in the demo instructions. After this was done, I wrote a script to convert the .wav sentence files to .raw files; a prerequisite for the synthesis. I configured the demo using the default variables. I encountered a problem with the .data file. I changed the naming scheme for the sentence file names (I added an underscore) and this solved the issue. I followed the subsequent instructions for the demo, and executed the necessary commands. The .utt files were created. The new data is now ready to be synthesized. On Monday, I met with Beppe Riccardi, a professor from Trento who visited our lab. We talked about my project, and he suggested looking into the topic of Active Learning to use to implement data selection. Carlos Busso gave a talk on Wednesday about detecting emotions in speech and motion. Erica and I discussed our project with him,
and he told us about archive.org, a website which offers video, text, software, and audio for free download. We were thinking of using this resource to obtain more audio files to train a voice with.
The results will be stronger if we use a lot of data. In the end we decided to stick with corpora, because the corresponding text for the audiobooks is not always available.
and he told us about archive.org, a website which offers video, text, software, and audio for free download. We were thinking of using this resource to obtain more audio files to train a voice with.
The results will be stronger if we use a lot of data. In the end we decided to stick with corpora, because the corresponding text for the audiobooks is not always available.