6 Forced Alignment

Once acoustic models have been created, Kaldi can also perform forced alignment on audio accompanied by a word-level transcript. Note that the Montreal Forced Aligner is a forced alignment system based on Kaldi-trained acoustic models for several world languages. You could also considering checking out FAVE for aligning American English speech.

Otherwise, if the audio to be aligned is the same as the audio used in the acoustic models, then the alignments can be extracted directly from the alignment files. If you have new audio and transcripts, then the transcript files will need to be updated before alignment.

The full procedure will convert output from the model alignment into Praat TextGrids containing the phone-level transcript.

If the data to be aligned is the same as the training data, skip to Section 6.4. Otherwise, you’ll need to update the transcript files and audio file specifications.

6.1 Prepare alignment files

To extract alignments for new transcripts and audio, you’ll need to create new versions of the files in the directory data/train. As a reminder, these files are text, segments, wav.scp, utt2spk, and spk2utt (see Section 5.2). We’ll house these in a new directory in mycorpus/data.

# create text, segments, wav.scp, utt2spk, and spk2utt

cd mycorpus/data
mkdir alignme         

6.2 Extract MFCC features

Revisit Section 5.7 on MFCC feature extraction for reference. You’ll need to replace data/train with the the new directory, data/alignme.

cd mycorpus
                    
mfccdir=mfcc
for x in data/alignme
do
    steps/make_mfcc.sh --cmd "$train_cmd" --nj 16 $x exp/make_mfcc/$x $mfccdir
    utils/fix_data_dir.sh data/alignme
    steps/compute_cmvn_stats.sh $x exp/make_mfcc/$x $mfccdir
    utils/fix_data_dir.sh data/alignme
done

6.3 Align data

Revisit Section 5.9 on triphone training and alignment for reference. Select the acoustic model and corresponding alignment process you’d like to use. You’ll need to replace data/train with the the new directory, data/alignme. As an example:

cd mycorpus
steps/align_si.sh --cmd "$train_cmd" data/alignme data/lang \
exp/tri4a exp/tri4a_alignme || exit 1;

6.4 Extract alignment

  • Obtain CTM output from alignment files

CTM stands for time-marked conversation file and contains a time-aligned phoneme transcription of the utterances. Its format is:

utt_id channel_num start_time phone_dur phone_id

To obtain these, you will need to decide which acoustic models to use. The following code will extract the CTM output from the alignment files in the directory tri4a_alignme, using the acoustic models in tri4a:

cd mycorpus
                    
for i in  exp/tri4a_alignme/ali.*.gz;
do src/bin/ali-to-phones --ctm-output exp/tri4a/final.mdl \
ark:"gunzip -c $i|" -> ${i%.gz}.ctm;
done;
  • Concatenate CTM files
cd mycorpus/exp/tri4a_alignme
cat *.ctm > merged_alignment.txt
  • Convert time marks and phone IDs

The CTM output reports start and end times relative to the utterance, as opposed to the file. You will need the segments file located in either data/train or data/alignme to convert the utterance times into file times.

The output also reports the phone ID, as opposed to the phone itself. You will need the phones.txt file located in data/lang to convert the phone IDs into phone symbols.

An example script to accomplish this can be downloaded here: id2phone.R

After obtaining the segments and phones.txt files, run id2phone.R to convert phone IDs to phones characters and map utterance times to file times. You will need to modify the file locations and possibly the regular expression to obtain the filename from the utterance name. Recall that the CTM output lists the utterance ID whereas the segments file lists the file ID. (If you named things logically, the file ID should be a subset of the utterance ID.)

id2phone.R returns a modified version of merged_alignment.txt called final_ali.txt

  • Split final_ali.txt by file

An example script to accomplish this can be downloaded here: splitAlignments.py

final_ali.txt contains the phone transcript for all files together. This can be split into unique files by running splitAlignments.py. You will need to modify the location of final_ali.txt in this script.

python splitAlignments.py
  • Create word alignments from phone endings

First we’ll need to use the [B I E S] suffixes on the phones in order to group phones together into word-level units.

Run phons2pron.py to complete this step. Note that I have utf-8 character encoding on this script. If necessary, this can be updated to reflect the character encoding that best matches your files.

Second, we’ll need to match the phone pronunciation to the corresponding lexical entry using lexicon.txt.

Run pron2words.py to complete this step.

6.5 Create Praat TextGrids

  • Append header to each of the text files for Praat

Praat requires that a text file have a header. Once we append the header, then we can convert these text files into TextGrids. The following code requires a text file containing the header:

file_utt file id ali startinutt dur phone start_utt end_utt start end

It also requires a tmp directory for processing. I put this on my Desktop.

cd ~/Desktop
mkdir tmp
                
header="/Users/Eleanor/Desktop/header.txt"
                
# direct the terminal to the directory with the newly split session files
# ensure that the RegEx below will capture only the session files
# otherwise change this or move the other .txt files to a different folder
                
cd mycorpus/forcedalignment
for i in *.txt;
do
    cat "$header" "$i" > /Users/Eleanor/Desktop/tmp/xx.$$
    mv /Users/Eleanor/Desktop/tmp/xx.$$ "$i"
done;
  • Make Praat TextGrids of phone alignments from .txt files

createtextgrid.praat will read in the new phone transcripts and corresponding audio files to create a TextGrid for that file. You will need to modify the locations of the phone transcripts and audio files.

  • Make Praat TextGrids for word alignments from word_alignment.txt

An example script to accomplish this can be downloaded here: createWordTextGrids.praat

  • Stack phone and word TextGrids

stackTextGrids.praat