Six Degrees of Separation: 2012

Sunday, July 15, 2012

Public Radio Stations in Pittsburgh

88.3 FM - WRCT, Progressive Radio Pittsburgh PA and Public Radio
89.3 FM - WQED, Classical and Jazz Radio Stations
89.7 FM - KGBM, Pittsburgh Religious & Christian Radio Stations
90.5 FM - WDUQ, Classical and Jazz Radio Stations
91.3 FM - WYEP, Progressive Radio Pittsburgh PA and Public Radio
92.1 FM - WPTS, Progressive Radio Pittsburgh PA and Public Radio
92.9 FM - WLTJ, Light (Lite) Rock / Soft Rock Radio Stations in Pittsburgh PA
93.7 FM - WRKZ - Various Music
94.5 FM - 3WS, Greatest Hits of the 1960's and 1970's
96.1 FM - WPHH, Light (Lite) Rock / Soft Rock Radio Stations in Pittsburgh PA
96.9 FM - Bob, They play anything
99.7 FM - WISH, Light (Lite) Rock / Soft Rock Radio Stations in Pittsburgh PA
100.7 FM - Star, Pittsburgh Best Variety Music
101.5 FM - Word, Pittsburgh Religious & Christian Radio Stations
102.5 WDVE - 102.5 FM - Steelers Broadcasts, Rock Radio Stations
104.7 FM - News Talk WPQB
105.9 The X - WXDX 105.9FM - Includes Penguins Hockey Broadcasts
108 FM - Country Radio
590 AM - WMBS, Adult Standards
970 AM - ESPN, Pittsburgh - Sports Radio Pittsburgh PA
730 AM - WPIT, Pittsburgh Religious & Christian Radio Stations
1250 AM - WTAE / WEAE, Sports Radio Pittsburgh PA
1320 AM - WJAS, Nostalgia Music Station Pittsburgh

Monday, July 9, 2012

Train a Recurrent Neural Network Language Model with Limited-sized Vocabulary

Mikolov et al. have shown that Neural network based language models can improve speech recognition result [1]. They also publish an open source toolkit for train a RNNLM (Recurrent Neural Network Language Model). Here is an instruction how to use it to train the model with limited vocabulary. With smaller vocabulary, it will be much faster to train and use the model. In addition, uncommon words don't provide much information about the sentence.

Here is a step-by-step introduction to train the model.

First, you need to prepare a corpus. Refer to my last blog to see how to extract a corpus from Wikipedia.

Secondly, download the RNNLM model from RNNLM.

Thirdly, replace uncommon words with "<unk>" in the corpus, tounk.py is a python code to do it. It has two arguments: the first argument is the path to the text corpus; the second argument is the vocabulary you want to use. The result will be printed out using stdout. A typical usage of tounk.py is:
$ python tounk.py $corpus $vocab > $corpus_unk
Where $corpus is the filename of your corpus, $vocab is a filename of the vocabulary, $corpus_unk is the output filename

Lastly, train the model. An example is shown in the script run_rnnlm.sh

References:
[1] Mikolov Tomáš, Deoras Anoop, Povey Daniel, Burget Lukáš, Černocký Jan: Strategies for Training Large Scale Neural Network Language Models, In: Proceedings of ASRU 2011

Monday, July 2, 2012

Language Modeling for Sphinx4

To perform speech recognition using Sphinx4, you need a language model.
Here we will introduce how to train a language model (LM) for Sphinx4 using the tool cmuslmt.

1) Prepare Corpus

You can use any free text. In this post, we will use Wikipedia. You can download and process the wikipedia corpus following in the steps introduced in http://trulymadlywordly.blogspot.com/2011/03/creating-text-corpus-from-wikipedia.html
However, you need some changes.
Firstly, only the first three steps are need: get that dump file, convert the dump file to sentences, convert the sentence list to a corpus file. The rest steps will be done with cmuslmt.
Secondly, you need to add "<s>" in the beginning of each sentence and "</s>" in the end of each sentence. Here is the script I wrote for this task: wikireader.py. It also splits the corpus into three parts: train, valid, test. Only "train" set will be used to train the language model, and 'test' will be used as a test set to measure how good of this model. In addition, since the whole wikipedia set is very large, so only 500M of them are used.

2) Download and install cmuslmt

a) download cmuslmt and unzip it
b) follow the introduction in README and install it (pay attention to "little-endian" or "big-endian")
c) you can add the ../bin path to your PATH environment variables, so that you can use the tool in other directories. Search it if you don't know how to edit "PATH"

3) Train the model

The basic steps are:
a) Create a Vocabulary using text2wfreq and wfreq2vocab commands, because the number of words in Wikipedia is so large;
b) Train a LM using text2idngram. You may need use cutoffs because cmuslmt cannot support "Size of trigram segment that is bigger than 65535"
c) Convert to DMP format (which is accepted by Sphinx4)
To simply your work, I wrote a script to do them automatically, see create_lm.sh (You need to change the parameters, such as the path name)

4) Apply it in Sphinx4

See introduction given by sphinx4 in http://cmusphinx.sourceforge.net/sphinx4/doc/UsingSphinxTrainModels.html.