Six Degrees of Separation: Language Modeling for Sphinx4

To perform speech recognition using Sphinx4, you need a language model.
Here we will introduce how to train a language model (LM) for Sphinx4 using the tool cmuslmt.

1) Prepare Corpus

You can use any free text. In this post, we will use Wikipedia. You can download and process the wikipedia corpus following in the steps introduced in http://trulymadlywordly.blogspot.com/2011/03/creating-text-corpus-from-wikipedia.html
However, you need some changes.
Firstly, only the first three steps are need: get that dump file, convert the dump file to sentences, convert the sentence list to a corpus file. The rest steps will be done with cmuslmt.
Secondly, you need to add "<s>" in the beginning of each sentence and "</s>" in the end of each sentence. Here is the script I wrote for this task: wikireader.py. It also splits the corpus into three parts: train, valid, test. Only "train" set will be used to train the language model, and 'test' will be used as a test set to measure how good of this model. In addition, since the whole wikipedia set is very large, so only 500M of them are used.

2) Download and install cmuslmt

a) download cmuslmt and unzip it
b) follow the introduction in README and install it (pay attention to "little-endian" or "big-endian")
c) you can add the ../bin path to your PATH environment variables, so that you can use the tool in other directories. Search it if you don't know how to edit "PATH"

3) Train the model

The basic steps are:
a) Create a Vocabulary using text2wfreq and wfreq2vocab commands, because the number of words in Wikipedia is so large;
b) Train a LM using text2idngram. You may need use cutoffs because cmuslmt cannot support "Size of trigram segment that is bigger than 65535"
c) Convert to DMP format (which is accepted by Sphinx4)
To simply your work, I wrote a script to do them automatically, see create_lm.sh (You need to change the parameters, such as the path name)