NOTES FOR THE MACHINE TRANSLATION (MT) NEWBIE --------------------------------------------- Here you will find a collection of notes compiled from my experiences working as a summer research intern at Carnegie Mellon's Language Technologies Institute implementing a Xhosa-English phrase-based translation system for Prof Carolyn Rose's codeswitching research. When I started, I had next-to-no experience working in a Linux environment; this document is written with Linux/Ubuntu newbies in mind, as well! Please keep in mind that your mileage may vary, depending on your hardware specifications, OS, etc. In my case, my HP laptop has two 2.5 GHz Intel Core 2 Duo T9300 processors (I guess that's a little redundant...), 4 GB of RAM, and approximately 250 GB of hard disk space. First things first: Install some Linux distribution onto your workstation. On my laptop, I installed Ubuntu 11.04 (Natty Narwhal), which has its ups and downs due to compatibility issues that arise when trying to install the following software and the deprecation of older versions of gcc (the GNU C Compiler), which is not an issue in older versions of Ubuntu, from what I've read. Please keep this in mind as you proceed, as you may or may not need to tinker around to get everything installed properly in your own environment, in which case, Google (or your search engine of choice) is most certainly an invaluable ally! What proceeds generally follows from the instructions provided on the following webpage for building a baseline SMT system: http://statmt.org/wmt11/baseline.html Anything preceded by a $ is generally something you need to enter into your shell (I use '$' b/c I'm working w/ bash), just in case you need a little extra help! Installing SRILM ---------------- 0. Download here: http://www.speech.sri.com/projects/srilm/ Note: You will have to agree to some terms of use. 1. Consider adding an environmental variable (env) to point towards the SRILM root directory (in my case, $SRILM=/usr/local/srilm). I personally had problems loading environment variables when simply adding them to /etc/environment, as recommended, so you may wish to simply edit your .profile file (should be located in your home directory; i.e., /home/), or you can create one if you don't already have one or some other similar file, like .bash_profile) to include a line such as: export SRILM=/path/to/srilm Unpack the downloaded tarball to /usr/local/srilm (or whatever directory you wish; I'm not afraid to admit that the first time around, I installed everything into the root bin (/bin) directory, which is reserved for system/kernel processes--don't make the same mistake I did!!!). $ mkdir $SRILM $ cd $SRILM $ tar -zxvf path/to/srilm.tgz Note: You may or may not need to handle this as a superuser/root (use sudo). 2. Now we need to handle dependencies. These include: build-essential, csh, tcl-dev, gawk. Install these, like so (this example is what you'd use type in bash on Ubuntu): $ sudo apt-get install build-essential csh tcl-dev gawk 3. Now run: $SRILM/sbin/machine-type to determine your machine type so you can edit the appropriate makefile in $SRILM/common. For example, I needed to edit the Makefile.machine.i686 file to set NO_TCL to X due to compilation errors; however, SRILM documentation indicates that this isn't a problem, since TCL is only needed during installation for testing and will not impact the performance and behavior of SRILM post-installation. For the thorough, you may wish to note what version of TCL you have on your system, as well as paths to the header files and library to edit the fields TCL_INCLUDE and TCL_LIBRARY in your makefile. You will also need to uncomment and add the proper path to SRILM near the top of the makefile. 4. This is important! In whatever file you use to store your envs, make sure LC_NUMERIC=C to ensure proper handling by the C compiler. Other instructions might suggest setting LC_ALL, but this may mess up your localization settings. 5. Go back to the top SRILM directory (e.g., $ cd $SRILM). Now type: $ sudo make World That was easy! 6. Run tests: $ sudo make test Your tests should result in "IDENTICAL" output. If you get messages that your output "DIFFERS", you might want to try installing again, because something didn't turn out quite right... 7. Finally: $ sudo make cleanest 8. Now add $SRILM/bin and $SRILM/bin/ to your path, and $SRILM/man to your manual path (MANPATH). To do so, again, just edit .profile and be sure to delimit new additions to your path variables w/ colons (:); e.g., PATH=$SRILM/bin:$SRILM/bin/:REST/OF/PATH MANPATH=$SRILM/man:REST/OF/MANPATH INSTALLING GIZA++ ----------------- 0. Download the latest version of GIZA++ here: http://code.google.com/p/giza-pp/downloads/list 1. Unpack to your desired location: $ mkdir /usr/local $ tar -zxvf giza-pp-v1.0.5.tar.gz This creates the folder giza-pp, with subdirectories GIZA++-v2 and mkcls-v2, among other things. 1. Here's where we run into issues; GIZA++ was written with an older version GCC in mind, which has since been deprecated. If you simply try to install as it is now, you will probably run into errors either when compiling or encounter memory overflow errors when actually attempting to align with GIZA++. The problem is documented here: http://code.google.com/p/giza-pp/issues/detail?id=11 You need to edit the file_spec.h file located in giza-pp/GIZA++-v2 as follows (obtained from the above discussion): Comment 4 by gil...@cs.rochester.edu, Jul 13, 2009 The year doesn't fit in two digits - suggested fix: *** file_spec.h 2009/07/10 21:38:39 1.1 --- file_spec.h 2009/07/13 11:37:21 *************** *** 37,49 **** struct tm *local; time_t t; char *user; ! char time_stmp[17]; char *file_spec = 0; t = time(NULL); local = localtime(&t); ! sprintf(time_stmp, "%02d-%02d-%02d.%02d%02d%02d.", local->tm_year, (local->tm_mon + 1), local->tm_mday, local->tm_hour, local->tm_min, local->tm_sec); user = getenv("USER"); --- 37,49 ---- struct tm *local; time_t t; char *user; ! char time_stmp[19]; char *file_spec = 0; t = time(NULL); local = localtime(&t); ! sprintf(time_stmp, "%04d-%02d-%02d.%02d%02d%02d.", 1900 + local->tm_year, (local->tm_mon + 1), local->tm_mday, local->tm_hour, local->tm_min, local->tm_sec); user = getenv("USER"); 3. Make sure you're back in the giza-pp (top) directory and install: $ sudo make If you have difficulties getting things to compile, you may want to try building from scratch an older version of gcc. This page might be helpful: http://misspent.wordpress.com/2011/04/26/compiling-g-4-1-on-ubuntu-natty-narwhal/ In fact, I had to do this to get GIZA++ to run properly on my system, but you might not have as painful an experience! 4. Copy GIZA++, snt2cooc.out, and mkcls to a bin directory (something that will be included in your path); i.e., $ cp GIZA++v2/GIZA++ /usr/local/bin $ cp GIZA++v2/snt2cooc.out /usr/local/bin $ cp mkcls-v2/mkcls /usr/local/bin INSTALLING MOSES ---------------- 0. Download it! The repository for MOSES can be found at: https://github.com/moses-smt/mosesdecoder Once you've placed the mosesdecoder directory in your desired location, you should create another env--let's call it $MOSES--that contains the full path to mosesdecoder. 1. Now to tackle dependencies: you'll need the packages found in aclocal, automake, autoconf, libtool, m4, and boost. (Nobody tells you these things, but you'll need them to properly compile MOSES!) $ sudo apt-get install aclocal automake autoconf libtool m4 The Boost C++ libraries must be installed manually; you can obtain the source code here: http://www.boost.org/users/download/ To install, simply download the source files to your directory of choice, make sure you're in that directory, then run the script bootstrap.sh: $ sudo ./bootstrap.sh Now you can install: $ sudo ./b2 install NOTE: This might take a while! 2. We're almost done! Just cd to $MOSES (go to the mosesdecoder directory), then run the regenerate-makefiles.sh script: $ sudo ./regenerate-makefiles.sh Next, we configure our MOSES build to use SRILM: $ sudo ./configure --with-srilm=$SRILM I hope you set that env, otherwise you'll have to type the full path to the top-level directory of SRILM. Finally, we can install it: $ sudo make THAT was easy. :) Don't forget to add $MOSES/moses/moses-cmd/src to your path: PATH=$MOSES/moses/moses-cmd/src:REST/OF/PATH If you're still having trouble, you may want to consult the "Get Started" guide here: http://www.statmt.org/moses/?n=Development.GetStarted If that proves unhelpful, there is also a more detailed step-by-step guide for installing MOSES here: http://www.statmt.org/moses_steps.html 3. You've successfully installed MOSES! Congratulations! Now for a few helpful hints on using it: -To run the decoder, use the command: $ moses -f /PATH/TO/moses.ini < $IN > $OUT (Hopefully you're included the path to it as suggested above!) The -f option indicates the configuration file you want to use (use the proper path to that particular moses.ini file), < $IN is the source file to be translated, and of course > $OUT will be where your translated output is to be saved. -ALWAYS check the proper moses.ini file (this is your CONFIGURATION FILE for a particular project and will be located within its corresponding directory). Make sure that the paths to your language model(s), phrase tables, etc. and settings are correct! -Training a baseline system is a whole other beast, but MOSES provides useful scripts for this task. One that will most likely prove useful is clean-corpus-n.perl, which will clean up a parallel corpus. Note that a corpus should be divided into two text files named $CORPUS.$L1 and $CORPUS.$L2; e.g., text.en and text.fr. These files should be aligned so that each sentence occupies one line. The clean-corpus-n.perl script can handle capitalization (although you will need to change a flag in the script), and will also clean up most extraneous whitespace, etc. For difficulties in producing aligned text, you might want to check out hunalign (http://mokk.bme.hu/resources/hunalign/). Anyway, to train an unfactored model (meaning we take into account only surface forms of words to create a barebones phrase-based translation system): $ train-model.perl --corpus PATH/TO/$CORPUS --root-dir $MODEL --e $L1 --f $L2 --lm 0:3:/FULL/PATH/TO/OUTPUT/LANGUAGE/MODEL:0 Note that these are the minimal parameters needed to train a model (there are a lot, lot more you can use!). The --lm option requires some explanation: the absolute path to your output (L2) LM must be provided, and the 0's indicate surface-form mapping from input to output (for the curious, the numbers 0-3 indicate which features should be mapped, where 0 = surface forms, 1 = lemmas (think of stems), 2 = part-of-speech (POS) tags, and 3 = morphological information). The 3 in the "second position" indicates that you want to consider up to trigrams (think of them as being triples of three consecutive words). More detailed information on training can be found here: http://www.statmt.org/moses/?n=Moses.FactoredTutorial Good luck!