All relevant files and scripts should be included here; if something isn't here, it's probably MOSES-related and can be obtained as described below! For posterity, Xhosa-English bible stuff (extraction and alignment scripts, etc.) are included as well in the folder xh-en-bible. MOSES and MOSES scripts (which need to be compiled on your system) can be downloaded as an SVN repository on sourceforge: http://mosesdecoder.sourceforge.net/svn.php Corpus obtained from capegateway.gov.za using wget w/ the mirror option (-m). Duplicate files removed with fnodupe. Pscrape used to align html files from both the xho and eng subdirectories (scrape is just included for posterity). All scripts written by me include instructions in their documentation and should be fairly straightforward to use. :) 0. Originals are xho.txt and eng.txt 1. File headers removed in scrape.en and scrape.xh 2. Tokenized (by MOSES tokenizer.perl script) in tokens.en and tokens.xh 3. Lowercased in lower.xh and lower.en 4. Lines w/ no text (containing only numbers, symbols, etc.) stripped in text.xh and text.en 5. Pre-processed w/ clean-corpus-n.perl in clean.en and clean.xh 6. Chunked w/ partialAlign.py (chunked data located in chunks--it was necessary to do this to avoid segmentation faults) and batch-processed through hunalign, filtered out bisegments w/ quality < 1 (reduced set by ~2 million...)--this set is located in corpus/filtered-1; alternatively, another set w/ bisegments of quality < 0 filtered out is located in corpus/filtered-0. Used null.dic and the -realign option to overcome lack of good plaintext bilingual dictionary, so hunalign mostly relied on Gale-Church sentence information. Read hunalign documentation for more information. 7. Used paste command to get aligned files w/out values in align.txt 8. Sorted (sort -u) to get unique tokens and eliminate repetition in align.unique, unique.xh, unique.en. Trained MOSES decoders on "strongly" filtered data and "weaker" filtered data, located in xh-en-filtered and xh-en-unfactored, respectively. Respective translated csv files can be found in each directory. Translation was automated by the script translate-csv. Configuration files for each are located in the relative path /model/moses.ini; paths to language models, etc. must be changed for your environment (MOSES requires FULL paths here). To translate, simply run the translate-csv bash script w/ (in order): the csv file (to translate), the field/col number of the translated text (use 1-indexing, not 0-indexing!!!), and the desired configuration file. A new, translated csv will be outputted, as well as a file where original text is "aligned" w/ translated text (delimited by a TAB).