Six Degrees of Separation: Train a Recurrent Neural Network Language Model with Limited-sized Vocabulary

Mikolov et al. have shown that Neural network based language models can improve speech recognition result [1]. They also publish an open source toolkit for train a RNNLM (Recurrent Neural Network Language Model). Here is an instruction how to use it to train the model with limited vocabulary. With smaller vocabulary, it will be much faster to train and use the model. In addition, uncommon words don't provide much information about the sentence.

Here is a step-by-step introduction to train the model.

First, you need to prepare a corpus. Refer to my last blog to see how to extract a corpus from Wikipedia.

Secondly, download the RNNLM model from RNNLM.

Thirdly, replace uncommon words with "<unk>" in the corpus, tounk.py is a python code to do it. It has two arguments: the first argument is the path to the text corpus; the second argument is the vocabulary you want to use. The result will be printed out using stdout. A typical usage of tounk.py is:
$ python tounk.py $corpus $vocab > $corpus_unk
Where $corpus is the filename of your corpus, $vocab is a filename of the vocabulary, $corpus_unk is the output filename

Lastly, train the model. An example is shown in the script run_rnnlm.sh

References:
[1] Mikolov Tomáš, Deoras Anoop, Povey Daniel, Burget Lukáš, Černocký Jan: Strategies for Training Large Scale Neural Network Language Models, In: Proceedings of ASRU 2011

Six Degrees of Separation

Monday, July 9, 2012

Train a Recurrent Neural Network Language Model with Limited-sized Vocabulary

No comments:

Post a Comment